Setup

library(cmdstanr)
library(dplyr)
library(lubridate)
library(ggplot2)
library(bayesplot)
library(posterior)

theme_set(bayesplot::theme_default())

# seed for R's pseudo-RNGs, not Stan's
set.seed(1123)

# load data
pest_data <- readRDS("data/pest_data.RDS")
standata_hier <- readRDS("data/standata_hier.RDS")

The problem

Background

Imagine that you are a statistician or data scientist working as an independent contractor. One of your clients is a company that owns many residential buildings throughout New York City. The property manager explains that they are concerned about the number of cockroach complaints that they receive from their buildings. Previously the company has offered ad-hoc visits from a pest inspector as a solution to this problem. While this is the default solution of many property managers in NYC, the tenants are rarely home when the inspector visits, and so the manager reasons that this is a relatively expensive solution that is currently not very effective.

One alternative to this problem is to deploy long term bait stations. In this alternative, child and pet safe bait stations are installed throughout the apartment building. Cockroaches obtain quick acting poison from these stations and distribute it throughout the colony. The manufacturer of these bait stations provides some indication of the space-to-bait efficacy, but the manager suspects that this guidance was not calculated with NYC roaches in mind. NYC roaches, the manager rationalizes, have more hustle than traditional roaches, and NYC buildings are built differently than other common residential buildings in the US. This is particularly important as the unit cost for each bait station per year is quite high.

The goal

The manager wishes to employ your services to help them to find the optimal number of roach bait stations they should place in each of their buildings in order to minimize the number of cockroach complaints while also keeping expenditure on pest control affordable.

A subset of the company’s buildings have been randomly selected for an experiment:

  • At the beginning of each month, a pest inspector randomly places a number of bait stations throughout the building, without knowledge of the current cockroach levels in the building
  • At the end of the month, the manager records the total number of cockroach complaints in that building.
  • The manager would like to determine the optimal number of bait stations (we’ll call them \(\textrm{traps}\)) that balances the losses (\(L\)) that complaints (\(\textrm{complaints}\)) generate with the all-in cost of maintaining the traps (\(\textrm{C}\)).
  • Here \(L\) include things like the cost of calling an exterminator, potential losses due to bad reviews of the buildings, etc., while \(C\) refers to the costs of maintaining the traps themselves.

Fortunately, Bayesian data analysis provides a coherent framework for us to tackle this problem.

Formally, we are interested in finding

\[ \arg\min_{\textrm{traps} \in \mathbb{N}} \mathbb{E}_{\text{complaints}}[L(\textrm{complaints}(\textrm{traps})) + \textrm{C}(\textrm{traps})] \]

Note that this also equals:

\[ \arg\max_{\textrm{traps} \in \mathbb{N}} \mathbb{E}_{\text{complaints}}[-L(\textrm{complaints}(\textrm{traps})) - \textrm{C}(\textrm{traps})] \]

We’ll use this formulation of the problem because it accords with other texts on decision theory (see, for example, Applied Statistical Decision Theory by Raiffa and Schlaifer).

The property manager would also, if possible, like to learn how these results generalize to buildings they haven’t treated so they can understand the potential costs of pest control at buildings they are acquiring as well as for the rest of their building portfolio.

As the property manager has complete control over the number of traps set, the random variable contributing to this expectation is the number of complaints given the number of traps. We will model the number of complaints as a function of the number of traps.

The data

The data provided to us is in a file called pest_data.RDS. Let’s load the data and see what the structure is:

pest_data <- readRDS('data/pest_data.RDS')
str(pest_data)
'data.frame':   120 obs. of  13 variables:
 $ building_id         : int  37 37 37 37 37 37 37 37 37 37 ...
 $ date                : Date, format: "2017-01-15" "2017-02-14" ...
 $ traps               : num  8 8 9 10 11 11 10 10 9 9 ...
 $ floors              : num  8 8 8 8 8 8 8 8 8 8 ...
 $ sq_footage_p_floor  : num  5149 5149 5149 5149 5149 ...
 $ live_in_super       : num  0 0 0 0 0 0 0 0 0 0 ...
 $ monthly_average_rent: num  3847 3847 3847 3847 3847 ...
 $ average_tenant_age  : num  53.9 53.9 53.9 53.9 53.9 ...
 $ age_of_building     : num  47 47 47 47 47 47 47 47 47 47 ...
 $ total_sq_foot       : num  41192 41192 41192 41192 41192 ...
 $ month               : num  1 2 3 4 5 6 7 8 9 10 ...
 $ complaints          : num  1 3 0 1 0 0 4 3 2 2 ...
 $ log_sq_foot_1e4     : num  1.42 1.42 1.42 1.42 1.42 ...

We have access to the following fields:

First, let’s see how many buildings we have in the data:

length(unique(pest_data$building_id))
[1] 10

And make some plots of the raw data:

ggplot(pest_data, aes(x = complaints)) +
  geom_bar()

ggplot(pest_data, aes(x = traps, y = complaints)) +
  geom_jitter()

ggplot(pest_data, aes(x = date, y = complaints, color = live_in_super == TRUE)) +
  geom_line(aes(linetype = "Number of complaints")) +
  geom_point(color = "black") +
  geom_line(aes(y = traps, linetype = "Number of traps"), color = "black", linewidth = 0.25) +
  facet_wrap(~building_id, scales = "free", ncol = 2, labeller = label_both) +
  scale_x_date(name = "Month", date_labels = "%b") +
  scale_y_continuous(name = "", limits = range(pest_data$complaints)) +
  scale_linetype_discrete(name = "") +
  scale_color_discrete(name = "Live-in super")

The first question we’ll look at is just whether the number of complaints per building per month is associated with the number of bait stations per building per month, ignoring the temporal and across-building variation (we’ll get to that later). This requires only two variables, \(\textrm{complaints}\) and \(\textrm{traps}\). How should we model the number of complaints?

Modeling count data : Poisson distribution

We already know some rudimentary information about what we should expect. The number of complaints over a month should be either zero or an integer. The property manager tells us that it is possible but unlikely that number of complaints in a given month is zero. Occasionally there are a very large number of complaints in a single month. A common way of modeling this sort of skewed, single bounded count data is as a Poisson random variable.

A note of caution about Poisson random variables before we proceed: Poisson random variables exhibit the distinctive characteristic that the variance equals the mean. In real-world count data, the variance often exceeds the mean, or, equivalently, the data are said to be ``over-dispersed’’. For now, however, we’ll start with the Poisson model and then check whether over-dispersion is a problem by comparing our model’s predictions to the data.

Model

Given that we have chosen a Poisson regression, we define the likelihood to be the Poisson probability mass function over the number bait stations placed in the building, denoted below as traps. As we said above, this model implies that the conditional mean of the outcome variable complaints (number of complaints) is the same as its conditional variance (\(E[Y|X] = Var[Y|X]\)). We’ll investigate whether this is a good assumption after we fit the model.

For building \(b = 1,\dots,10\) at time (month) \(t = 1,\dots,12\), we have

\[ \begin{align*} \textrm{complaints}_{b,t} & \sim \textrm{Poisson}(\lambda_{b,t}) \\ \lambda_{b,t} & = \exp{(\eta_{b,t})} \\ \eta_{b,t} &= \alpha + \beta \, \textrm{traps}_{b,t} \end{align*} \]

Prior predictive checks

Before we fit the model to real data, we should check that our priors and model can generate plausible simulated data. What do you think might be reasonable priors?

In R we can simulate from the prior predictive distribution using a function like this:

# poisson data generating process (dgp)
# using normal distributions for priors on alpha and beta
simple_poisson_dgp <- function(traps, alpha_mean, alpha_sd, beta_mean, beta_sd) {
  N <- length(traps)
  alpha <- rnorm(1, mean = alpha_mean, sd = alpha_sd)
  beta <- rnorm(1, mean = beta_mean, sd = beta_sd)
  complaints <- rpois(N, lambda = exp(alpha + beta * traps))
  return(complaints)
}

Try with \(N(0,10)\) priors on both parameters:

# you can run this chunk multiple times to keep generating different datasets
simple_poisson_dgp(
  pest_data$traps,
  alpha_mean = 0,
  alpha_sd = 10,
  beta_mean = 0,
  beta_sd = 10
)
  [1] 2.943756e+37 2.943756e+37 3.560414e+41 4.306249e+45 5.208322e+49
  [6] 5.208322e+49 4.306249e+45 4.306249e+45 3.560414e+41 3.560414e+41
 [11] 2.943756e+37 3.560414e+41 2.433903e+33 2.433903e+33 2.943756e+37
 [16] 3.560414e+41 4.306249e+45 3.560414e+41 3.560414e+41 4.306249e+45
 [21] 4.306249e+45 3.560414e+41 3.560414e+41 2.943756e+37 2.943756e+37
 [26] 2.943756e+37 2.433903e+33 2.012355e+29 2.012355e+29 2.433903e+33
 [31] 2.012355e+29 2.433903e+33 2.433903e+33 2.433903e+33 2.433903e+33
 [36] 2.943756e+37 2.012355e+29 2.012355e+29 2.433903e+33 2.433903e+33
 [41] 2.012355e+29 2.433903e+33 2.433903e+33 2.012355e+29 2.433903e+33
 [46] 2.012355e+29 2.433903e+33 2.943756e+37 2.433903e+33 2.433903e+33
 [51] 2.943756e+37 2.433903e+33 2.012355e+29 2.012355e+29 2.433903e+33
 [56] 2.433903e+33 2.012355e+29 2.012355e+29 2.012355e+29 2.433903e+33
 [61] 2.433903e+33 2.433903e+33 2.433903e+33 2.943756e+37 2.433903e+33
 [66] 2.433903e+33 2.433903e+33 2.012355e+29 2.012355e+29 1.663818e+25
 [71] 2.012355e+29 2.433903e+33 1.663818e+25 1.663818e+25 2.012355e+29
 [76] 2.012355e+29 2.433903e+33 2.943756e+37 3.560414e+41 3.560414e+41
 [81] 3.560414e+41 3.560414e+41 3.560414e+41 4.306249e+45 2.433903e+33
 [86] 2.433903e+33 2.433903e+33 2.012355e+29 1.663818e+25 1.375648e+21
 [91] 1.375648e+21 1.663818e+25 1.663818e+25 2.012355e+29 1.663818e+25
 [96] 1.663818e+25 2.943756e+37 2.433903e+33 2.012355e+29 2.433903e+33
[101] 2.943756e+37 2.943756e+37 2.943756e+37 3.560414e+41 2.943756e+37
[106] 2.433903e+33 2.943756e+37 2.943756e+37 2.433903e+33 2.943756e+37
[111] 2.433903e+33 2.012355e+29 2.012355e+29 1.663818e+25 1.375648e+21
[116] 1.137388e+17 9.403941e+12 9.403946e+12 9.403940e+12 7.775135e+08

Try with \(N(0,1)\) priors on both parameters:

simple_poisson_dgp(
  pest_data$traps,
  alpha_mean = 0,
  alpha_sd = 1,
  beta_mean = 0,
  beta_sd = 1
)
  [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 [38] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 [75] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[112] 0 0 0 0 0 0 0 0 0

A more reasonable prior:

simple_poisson_dgp(
  traps = pest_data$traps,
  alpha_mean = 2,     # approx log(7), i.e., if no traps expect about 7 complaints
  alpha_sd = 1,
  beta_mean = -0.25,  # assume more traps -> fewer complaints
  beta_sd = 0.5       # but big enough sd to allow us to be wrong
)
  [1] 25 26 35 47 52 51 39 45 32 39 35 32 22 18 29 39 37 37 35 32 43 29 41 23 27
 [26] 22 18 25 12 26 14 28 21 17 29 20 14 22 19 13 12 17 21 15 22 14 23 35 22 16
 [51] 24 20 16 18 26 22 17 17 25 29 22 13 26 29 22 17 15 16 17 11 21 23 16 17 17
 [76] 15 30 20 37 37 25 26 30 32 28 15 20 20 19 16 11 18 17 14 15 10 29 33 26 17
[101] 24 29 22 35 27 25 21 20 24 23 31 19 17 18 13  6  7 11 12  5

Simulate 1000 times and calculate summary stats:

prior_preds <- t(replicate(1000, expr = {
  simple_poisson_dgp(
    traps = pest_data$traps,
    alpha_mean = 2,
    alpha_sd = 1,
    beta_mean = -0.25,
    beta_sd = 0.5
  )
}))
summary(apply(prior_preds, 1, mean))
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
     0.0      0.2      1.9   1160.5     15.5 543048.4 
summary(apply(prior_preds, 1, min))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   0.000   0.000   5.244   3.000 200.000 
summary(apply(prior_preds, 1, max))
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
       0        4       10    18208       35 10487100 

Plot the distribution of means:

ppd_stat(prior_preds, stat = "mean")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ppd_stat(prior_preds, stat = "mean") + 
  xlim(-1, 50) + 
  ggtitle("Prior predictive distribution of avg. # of complaints")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Plot the distribution of 90th percentiles:

ppd_stat(prior_preds, stat = function(x) quantile(x, probs = 0.9)) + 
  xlim(-1, 50) + 
  ggtitle("Prior predictive distribution of 90th percentile of complaints")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

This is more reasonable than the previous priors, although there is still room for improvement. For example, when checking the mean of the prior predictions some outlandishly high maximums are too influential (the quartiles are more reasonable). For now we’ll still use this but we’ll get more realistic as we develop a better model later.

Writing and fitting our first Stan program

Let’s encode the model in a Stan program.

  • Write simple_poisson_regression.stan together, put in stan_programs directory.
comp_simple_pois <- cmdstan_model("stan_programs/simple_poisson_regression.stan")

To fit the model to the data we’ll first create a list to pass to Stan using the variables in the pest_data data frame. The names of the list elements must match the names used in the data block of the Stan program.

standata_simple <- list(
  N = nrow(pest_data),
  complaints = pest_data$complaints,
  traps = pest_data$traps
)
str(standata_simple)
List of 3
 $ N         : int 120
 $ complaints: num [1:120] 1 3 0 1 0 0 4 3 2 2 ...
 $ traps     : num [1:120] 8 8 9 10 11 11 10 10 9 9 ...

We have already compiled the model, so we can jump straight to sampling from it.

fit_simple_pois <- comp_simple_pois$sample(
  data = standata_simple,
  # these are the defaults but specifying them anyway
  # so you can see how to use them:
  chains = 4,
  iter_warmup = 1000,
  iter_sampling = 1000
)
Running MCMC with 4 sequential chains...

Chain 1 Iteration:    1 / 2000 [  0%]  (Warmup) 
Chain 1 Iteration:  100 / 2000 [  5%]  (Warmup) 
Chain 1 Iteration:  200 / 2000 [ 10%]  (Warmup) 
Chain 1 Iteration:  300 / 2000 [ 15%]  (Warmup) 
Chain 1 Iteration:  400 / 2000 [ 20%]  (Warmup) 
Chain 1 Iteration:  500 / 2000 [ 25%]  (Warmup) 
Chain 1 Iteration:  600 / 2000 [ 30%]  (Warmup) 
Chain 1 Iteration:  700 / 2000 [ 35%]  (Warmup) 
Chain 1 Iteration:  800 / 2000 [ 40%]  (Warmup) 
Chain 1 Iteration:  900 / 2000 [ 45%]  (Warmup) 
Chain 1 Iteration: 1000 / 2000 [ 50%]  (Warmup) 
Chain 1 Iteration: 1001 / 2000 [ 50%]  (Sampling) 
Chain 1 Iteration: 1100 / 2000 [ 55%]  (Sampling) 
Chain 1 Iteration: 1200 / 2000 [ 60%]  (Sampling) 
Chain 1 Iteration: 1300 / 2000 [ 65%]  (Sampling) 
Chain 1 Iteration: 1400 / 2000 [ 70%]  (Sampling) 
Chain 1 Iteration: 1500 / 2000 [ 75%]  (Sampling) 
Chain 1 Iteration: 1600 / 2000 [ 80%]  (Sampling) 
Chain 1 Iteration: 1700 / 2000 [ 85%]  (Sampling) 
Chain 1 Iteration: 1800 / 2000 [ 90%]  (Sampling) 
Chain 1 Iteration: 1900 / 2000 [ 95%]  (Sampling) 
Chain 1 Iteration: 2000 / 2000 [100%]  (Sampling) 
Chain 1 finished in 0.1 seconds.
Chain 2 Iteration:    1 / 2000 [  0%]  (Warmup) 
Chain 2 Iteration:  100 / 2000 [  5%]  (Warmup) 
Chain 2 Iteration:  200 / 2000 [ 10%]  (Warmup) 
Chain 2 Iteration:  300 / 2000 [ 15%]  (Warmup) 
Chain 2 Iteration:  400 / 2000 [ 20%]  (Warmup) 
Chain 2 Iteration:  500 / 2000 [ 25%]  (Warmup) 
Chain 2 Iteration:  600 / 2000 [ 30%]  (Warmup) 
Chain 2 Iteration:  700 / 2000 [ 35%]  (Warmup) 
Chain 2 Iteration:  800 / 2000 [ 40%]  (Warmup) 
Chain 2 Iteration:  900 / 2000 [ 45%]  (Warmup) 
Chain 2 Iteration: 1000 / 2000 [ 50%]  (Warmup) 
Chain 2 Iteration: 1001 / 2000 [ 50%]  (Sampling) 
Chain 2 Iteration: 1100 / 2000 [ 55%]  (Sampling) 
Chain 2 Iteration: 1200 / 2000 [ 60%]  (Sampling) 
Chain 2 Iteration: 1300 / 2000 [ 65%]  (Sampling) 
Chain 2 Iteration: 1400 / 2000 [ 70%]  (Sampling) 
Chain 2 Iteration: 1500 / 2000 [ 75%]  (Sampling) 
Chain 2 Iteration: 1600 / 2000 [ 80%]  (Sampling) 
Chain 2 Iteration: 1700 / 2000 [ 85%]  (Sampling) 
Chain 2 Iteration: 1800 / 2000 [ 90%]  (Sampling) 
Chain 2 Iteration: 1900 / 2000 [ 95%]  (Sampling) 
Chain 2 Iteration: 2000 / 2000 [100%]  (Sampling) 
Chain 2 finished in 0.1 seconds.
Chain 3 Iteration:    1 / 2000 [  0%]  (Warmup) 
Chain 3 Iteration:  100 / 2000 [  5%]  (Warmup) 
Chain 3 Iteration:  200 / 2000 [ 10%]  (Warmup) 
Chain 3 Iteration:  300 / 2000 [ 15%]  (Warmup) 
Chain 3 Iteration:  400 / 2000 [ 20%]  (Warmup) 
Chain 3 Iteration:  500 / 2000 [ 25%]  (Warmup) 
Chain 3 Iteration:  600 / 2000 [ 30%]  (Warmup) 
Chain 3 Iteration:  700 / 2000 [ 35%]  (Warmup) 
Chain 3 Iteration:  800 / 2000 [ 40%]  (Warmup) 
Chain 3 Iteration:  900 / 2000 [ 45%]  (Warmup) 
Chain 3 Iteration: 1000 / 2000 [ 50%]  (Warmup) 
Chain 3 Iteration: 1001 / 2000 [ 50%]  (Sampling) 
Chain 3 Iteration: 1100 / 2000 [ 55%]  (Sampling) 
Chain 3 Iteration: 1200 / 2000 [ 60%]  (Sampling) 
Chain 3 Iteration: 1300 / 2000 [ 65%]  (Sampling) 
Chain 3 Iteration: 1400 / 2000 [ 70%]  (Sampling) 
Chain 3 Iteration: 1500 / 2000 [ 75%]  (Sampling) 
Chain 3 Iteration: 1600 / 2000 [ 80%]  (Sampling) 
Chain 3 Iteration: 1700 / 2000 [ 85%]  (Sampling) 
Chain 3 Iteration: 1800 / 2000 [ 90%]  (Sampling) 
Chain 3 Iteration: 1900 / 2000 [ 95%]  (Sampling) 
Chain 3 Iteration: 2000 / 2000 [100%]  (Sampling) 
Chain 3 finished in 0.1 seconds.
Chain 4 Iteration:    1 / 2000 [  0%]  (Warmup) 
Chain 4 Iteration:  100 / 2000 [  5%]  (Warmup) 
Chain 4 Iteration:  200 / 2000 [ 10%]  (Warmup) 
Chain 4 Iteration:  300 / 2000 [ 15%]  (Warmup) 
Chain 4 Iteration:  400 / 2000 [ 20%]  (Warmup) 
Chain 4 Iteration:  500 / 2000 [ 25%]  (Warmup) 
Chain 4 Iteration:  600 / 2000 [ 30%]  (Warmup) 
Chain 4 Iteration:  700 / 2000 [ 35%]  (Warmup) 
Chain 4 Iteration:  800 / 2000 [ 40%]  (Warmup) 
Chain 4 Iteration:  900 / 2000 [ 45%]  (Warmup) 
Chain 4 Iteration: 1000 / 2000 [ 50%]  (Warmup) 
Chain 4 Iteration: 1001 / 2000 [ 50%]  (Sampling) 
Chain 4 Iteration: 1100 / 2000 [ 55%]  (Sampling) 
Chain 4 Iteration: 1200 / 2000 [ 60%]  (Sampling) 
Chain 4 Iteration: 1300 / 2000 [ 65%]  (Sampling) 
Chain 4 Iteration: 1400 / 2000 [ 70%]  (Sampling) 
Chain 4 Iteration: 1500 / 2000 [ 75%]  (Sampling) 
Chain 4 Iteration: 1600 / 2000 [ 80%]  (Sampling) 
Chain 4 Iteration: 1700 / 2000 [ 85%]  (Sampling) 
Chain 4 Iteration: 1800 / 2000 [ 90%]  (Sampling) 
Chain 4 Iteration: 1900 / 2000 [ 95%]  (Sampling) 
Chain 4 Iteration: 2000 / 2000 [100%]  (Sampling) 
Chain 4 finished in 0.1 seconds.

All 4 chains finished successfully.
Mean chain execution time: 0.1 seconds.
Total execution time: 1.1 seconds.
# The summary method calls posterior::summarize_draws()
fit_simple_pois$summary(variables = c("alpha", "beta"))
# A tibble: 2 × 10
  variable   mean median     sd    mad     q5    q95  rhat ess_bulk ess_tail
  <chr>     <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl> <dbl>    <dbl>    <dbl>
1 alpha     2.60   2.60  0.156  0.154   2.34   2.85   1.00     751.     707.
2 beta     -0.195 -0.195 0.0235 0.0236 -0.233 -0.156  1.00     761.     635.

We can also plot the posterior distributions:

# https://mc-stan.org/bayesplot/reference/MCMC-distributions
draws <- fit_simple_pois$draws(c("alpha", "beta"), format = "matrix")
mcmc_hist(draws) # marginal posteriors of alpha and beta
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

mcmc_scatter(draws, alpha = 0.2, size = 1) # joint posterior of (alpha, beta)

And compare them to draws from the prior:

alpha_prior_post <- cbind(
  alpha_prior = rnorm(4000, 2, 1),
  alpha_posterior = as.vector(draws[, "alpha"])
)
mcmc_hist(
  alpha_prior_post,
  facet_args = list(nrow = 2),
  binwidth = 0.1
) + xlim(range(alpha_prior_post))

beta_prior_post <- cbind(
  beta_prior = rnorm(4000, -0.25, 0.5),
  beta_posterior = as.vector(draws[, "beta"])
)
mcmc_hist(
  beta_prior_post,
  facet_args = list(nrow = 2),
  binwidth = 0.05
) + xlim(range(beta_prior_post))

From the posterior of beta, it appears increasing the number of bait stations set in a building is associated with a lower number of complaints about cockroaches that were made in the following month. However, we still need to consider how well the model fits.

To try at home: change the priors and see how it affects the posterior.

Posterior predictive checking

y_rep <- fit_simple_pois$draws("y_rep", format = "matrix")
dim(y_rep) # number of posterior draws x number of data points
[1] 4000  120
# https://mc-stan.org/bayesplot/reference/PPC-distributions#plot-descriptions
ppc_dens_overlay(y = standata_simple$complaints, yrep = y_rep[1:200,])

# can also compare cumulative distribution functions
ppc_ecdf_overlay(y = standata_simple$complaints, yrep = y_rep[1:200,])

In the plot above we have the kernel density estimate of the observed data (\(y\), thicker curve) and 200 simulated data sets (\(y_{rep}\), thin curves) from the posterior predictive distribution. Here the posterior predictive simulations are not as dispersed as the real data and don’t seem to capture the rate of zeros well at all. This suggests the Poisson model may not be sufficiently flexible for this data.

Let’s explore this further by looking directly at the proportion of zeros in the real data and predicted data.

# calculate the proportion of zeros in a vector
prop_zero <- function(x) mean(x == 0)

# https://mc-stan.org/bayesplot/reference/PPC-test-statistics#plot-descriptions
ppc_stat(
  y = standata_simple$complaints,
  yrep = y_rep,
  stat = "prop_zero",
  binwidth = .01
)

The plot above shows the observed proportion of zeros (thick vertical line) and a histogram of the proportion of zeros in each of the simulated data sets. It is clear that the model does not capture this feature of the data well at all.

The rootogram is another useful plot to compare the observed vs expected number of complaints. This is a plot of the square root of the expected counts (continuous line) vs the observed counts (blue histogram)

# https://mc-stan.org/bayesplot/reference/PPC-discrete#plot-descriptions
ppc_rootogram(standata_simple$complaints, yrep = y_rep)

If the model was fitting well these would be relatively similar, however in this figure we can see the number of complaints is underestimated if there are few complaints, over-estimated for medium numbers of complaints, and underestimated if there are a large number of complaints.

The ppc_bars() function will make a bar plot of the observed values and overlay prediction intervals but not on the square root scale (unlike the rootogram):

# https://mc-stan.org/bayesplot/reference/PPC-discrete#plot-descriptions
ppc_bars(standata_simple$complaints, yrep = y_rep) + labs(x = "Complaints")

We can also view how the predicted number of complaints varies with the number of traps:

# jitter number of traps (x) since we have multiple observations at some values
ppc_intervals(
  y = standata_simple$complaints, 
  yrep = y_rep,
  x = standata_simple$traps + rnorm(standata_simple$N, 0, 0.1)
) +  
  labs(
    x = "Number of traps (jittered)", 
    y = "Predicted number of complaints"
  )

Check the predicted mean number of complaints by building:

ppc_stat_grouped(
  y = standata_simple$complaints,
  yrep = y_rep,
  group = pest_data$building_id,
  stat = "mean",
  binwidth = 0.2
)

We haven’t included building specific information so shouldn’t expect this to be great yet.

Expanding the model: multiple predictors

Currently, our model’s mean parameter is a rate of complaints per 30 days, but we’re modeling a process that occurs over an area as well as over time. We have the square footage of each building, so if we add that information into the model, we can interpret our parameters as a rate of complaints per square foot per 30 days.

\[ \begin{align*} \textrm{complaints}_{b,t} & \sim \textrm{Poisson}(\textrm{sq_foot}_b\,\lambda_{b,t}) \\ \lambda_{b,t} & = \exp{(\eta_{b,t} )} \\ \eta_{b,t} &= \alpha + \beta \, \textrm{traps}_{b,t} \end{align*} \]

The term \(\text{sq_foot}\) is called an exposure term. If we log the term, we can put it in \(\eta_{b,t}\):

\[ \begin{align*} \textrm{complaints}_{b,t} & \sim \textrm{Poisson}(\lambda_{b,t}) \\ \lambda_{b,t} & = \exp{(\eta_{b,t} )} \\ \eta_{b,t} &= \alpha + \beta \, \textrm{traps}_{b,t} + \textrm{log_sq_foot}_b \end{align*} \]

We will also include whether there is a live-in super or not as a predictor for the number of complaints, which gives us gives us:

\[ \eta_{b,t} = \alpha + \beta \, \textrm{traps}_{b,t} + \beta_{\rm super} \textrm{live_in_super}_{b,t} + \textrm{log_sq_foot}_b \]

Add these new variables to the data list for Stan.

standata_simple$log_sq_foot <- pest_data$log_sq_foot_1e4 # log(total_sq_foot/1e4)
standata_simple$live_in_super <- pest_data$live_in_super

Stan program for Poisson multiple regression

Now we need a new Stan program that uses multiple predictors.

  • Write multiple_poisson_regression.stan
comp_multi_pois <- cmdstan_model("stan_programs/multiple_poisson_regression.stan")

Fit the Poisson multiple regression

fit_multi_pois <- comp_multi_pois$sample(
  data = standata_simple,
  refresh = 0 # turn off printed progress updates
) 
Running MCMC with 4 sequential chains...

Chain 1 finished in 0.3 seconds.
Chain 2 finished in 0.3 seconds.
Chain 3 finished in 0.4 seconds.
Chain 4 finished in 0.3 seconds.

All 4 chains finished successfully.
Mean chain execution time: 0.3 seconds.
Total execution time: 1.9 seconds.
fit_multi_pois$summary(variables = c("alpha", "beta", "beta_super"))
# A tibble: 3 × 10
  variable     mean median     sd    mad     q5     q95  rhat ess_bulk ess_tail
  <chr>       <dbl>  <dbl>  <dbl>  <dbl>  <dbl>   <dbl> <dbl>    <dbl>    <dbl>
1 alpha       1.22   1.22  0.209  0.206   0.878  1.57    1.00     891.    1233.
2 beta       -0.215 -0.215 0.0283 0.0279 -0.264 -0.169   1.00     944.    1201.
3 beta_super -0.292 -0.291 0.125  0.123  -0.502 -0.0930  1.01    1045.    1468.
y_rep <- fit_multi_pois$draws("y_rep", format = "matrix")
ppc_dens_overlay(standata_simple$complaints, y_rep[1:200,])

ppc_ecdf_overlay(standata_simple$complaints, y_rep[1:200,])

This again looks like we haven’t captured the smaller counts very well, nor have we captured the larger counts.

We’re still severely underestimating the proportion of zeros in the data:

prop_zero <- function(x) mean(x == 0)
ppc_stat(
  y = standata_simple$complaints,
  yrep = y_rep,
  stat = "prop_zero",
  binwidth = 0.01
)

Ideally this vertical line would fall somewhere within the histogram.

We can also plot uncertainty intervals for the predicted complaints for different numbers of traps.

ppc_intervals(
  y = standata_simple$complaints,
  yrep = y_rep,
  x = standata_simple$traps + rnorm(standata_simple$N, 0, 0.1)
) +
  labs(
    x = "Number of traps (jittered)", 
    y = "Predicted number of complaints"
  )

We can see that we’ve increased the tails a bit more at the larger numbers of traps but we still have some large observed numbers of complaints that the model would consider extremely unlikely events.

Modeling count data with the Negative Binomial

When we considered modeling the data using a Poisson, we saw that the model didn’t appear to fit as well to the data as we would like. In particular the model underpredicted low and high numbers of complaints, and overpredicted the medium number of complaints. This is one indication of over-dispersion, where the variance is larger than the mean. (Maybe people either complain a lot or don’t complain much? Maybe other reasons?) A Poisson model doesn’t fit over-dispersed count data very well because the same parameter \(\lambda\), controls both the expected counts and the variance of these counts. The natural alternative to this is the negative binomial model:

\[ \begin{align*} \text{complaints}_{b,t} & \sim \text{Neg-Binomial}(\lambda_{b,t}, \phi) \\ \lambda_{b,t} & = \exp{(\eta_{b,t})} \\ \eta_{b,t} &= \alpha + \beta \, {\rm traps}_{b,t} + \beta_{\rm super} \, {\rm super}_{b} + \text{log_sq_foot}_{b} \end{align*} \]

In Stan the negative binomial mass function we’ll use is called \(\texttt{neg_binomial_2_log}(\text{ints} \, y, \text{reals} \, \eta, \text{reals} \, \phi)\) in Stan. Like the poisson_log function, this negative binomial mass function that is parameterized in terms of its log-mean, \(\eta\), but it also has a precision \(\phi\) such that

\[ \mathbb{E}[y] \, = \lambda = \exp(\eta) \]

\[ \text{Var}[y] = \lambda + \lambda^2/\phi = \exp(\eta) + \exp(\eta)^2 / \phi. \]

As \(\phi\) gets larger the term \(\lambda^2 / \phi\) approaches zero and so the variance of the negative-binomial approaches \(\lambda\), i.e., the negative-binomial gets closer and closer to the Poisson.

Stan program for negative-binomial regression

  • Write multiple_NB_regression.stan together
comp_multi_NB <- cmdstan_model("stan_programs/multiple_NB_regression.stan")

Fit to data and check the fit

fit_multi_NB <- comp_multi_NB$sample(data = standata_simple)
Running MCMC with 4 sequential chains...

Chain 1 Iteration:    1 / 2000 [  0%]  (Warmup) 
Chain 1 Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
Chain 1 Exception: neg_binomial_2_log_lpmf: Precision parameter is inf, but must be positive finite! (in '/var/folders/s0/zfzm55px2nd2v__zlw5xfj2h0000gn/T/RtmpTngcHs/model-93a160694e4.stan', line 24, column 2 to column 44)
Chain 1 If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
Chain 1 but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.
Chain 1 
Chain 1 Iteration:  100 / 2000 [  5%]  (Warmup) 
Chain 1 Iteration:  200 / 2000 [ 10%]  (Warmup) 
Chain 1 Iteration:  300 / 2000 [ 15%]  (Warmup) 
Chain 1 Iteration:  400 / 2000 [ 20%]  (Warmup) 
Chain 1 Iteration:  500 / 2000 [ 25%]  (Warmup) 
Chain 1 Iteration:  600 / 2000 [ 30%]  (Warmup) 
Chain 1 Iteration:  700 / 2000 [ 35%]  (Warmup) 
Chain 1 Iteration:  800 / 2000 [ 40%]  (Warmup) 
Chain 1 Iteration:  900 / 2000 [ 45%]  (Warmup) 
Chain 1 Iteration: 1000 / 2000 [ 50%]  (Warmup) 
Chain 1 Iteration: 1001 / 2000 [ 50%]  (Sampling) 
Chain 1 Iteration: 1100 / 2000 [ 55%]  (Sampling) 
Chain 1 Iteration: 1200 / 2000 [ 60%]  (Sampling) 
Chain 1 Iteration: 1300 / 2000 [ 65%]  (Sampling) 
Chain 1 Iteration: 1400 / 2000 [ 70%]  (Sampling) 
Chain 1 Iteration: 1500 / 2000 [ 75%]  (Sampling) 
Chain 1 Iteration: 1600 / 2000 [ 80%]  (Sampling) 
Chain 1 Iteration: 1700 / 2000 [ 85%]  (Sampling) 
Chain 1 Iteration: 1800 / 2000 [ 90%]  (Sampling) 
Chain 1 Iteration: 1900 / 2000 [ 95%]  (Sampling) 
Chain 1 Iteration: 2000 / 2000 [100%]  (Sampling) 
Chain 1 finished in 0.9 seconds.
Chain 2 Iteration:    1 / 2000 [  0%]  (Warmup) 
Chain 2 Iteration:  100 / 2000 [  5%]  (Warmup) 
Chain 2 Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
Chain 2 Exception: neg_binomial_2_log_lpmf: Precision parameter is 0, but must be positive finite! (in '/var/folders/s0/zfzm55px2nd2v__zlw5xfj2h0000gn/T/RtmpTngcHs/model-93a160694e4.stan', line 24, column 2 to column 44)
Chain 2 If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
Chain 2 but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.
Chain 2 
Chain 2 Iteration:  200 / 2000 [ 10%]  (Warmup) 
Chain 2 Iteration:  300 / 2000 [ 15%]  (Warmup) 
Chain 2 Iteration:  400 / 2000 [ 20%]  (Warmup) 
Chain 2 Iteration:  500 / 2000 [ 25%]  (Warmup) 
Chain 2 Iteration:  600 / 2000 [ 30%]  (Warmup) 
Chain 2 Iteration:  700 / 2000 [ 35%]  (Warmup) 
Chain 2 Iteration:  800 / 2000 [ 40%]  (Warmup) 
Chain 2 Iteration:  900 / 2000 [ 45%]  (Warmup) 
Chain 2 Iteration: 1000 / 2000 [ 50%]  (Warmup) 
Chain 2 Iteration: 1001 / 2000 [ 50%]  (Sampling) 
Chain 2 Iteration: 1100 / 2000 [ 55%]  (Sampling) 
Chain 2 Iteration: 1200 / 2000 [ 60%]  (Sampling) 
Chain 2 Iteration: 1300 / 2000 [ 65%]  (Sampling) 
Chain 2 Iteration: 1400 / 2000 [ 70%]  (Sampling) 
Chain 2 Iteration: 1500 / 2000 [ 75%]  (Sampling) 
Chain 2 Iteration: 1600 / 2000 [ 80%]  (Sampling) 
Chain 2 Iteration: 1700 / 2000 [ 85%]  (Sampling) 
Chain 2 Iteration: 1800 / 2000 [ 90%]  (Sampling) 
Chain 2 Iteration: 1900 / 2000 [ 95%]  (Sampling) 
Chain 2 Iteration: 2000 / 2000 [100%]  (Sampling) 
Chain 2 finished in 0.9 seconds.
Chain 3 Iteration:    1 / 2000 [  0%]  (Warmup) 
Chain 3 Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
Chain 3 Exception: neg_binomial_2_log_lpmf: Precision parameter is 0, but must be positive finite! (in '/var/folders/s0/zfzm55px2nd2v__zlw5xfj2h0000gn/T/RtmpTngcHs/model-93a160694e4.stan', line 24, column 2 to column 44)
Chain 3 If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
Chain 3 but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.
Chain 3 
Chain 3 Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
Chain 3 Exception: neg_binomial_2_log_lpmf: Precision parameter is 0, but must be positive finite! (in '/var/folders/s0/zfzm55px2nd2v__zlw5xfj2h0000gn/T/RtmpTngcHs/model-93a160694e4.stan', line 24, column 2 to column 44)
Chain 3 If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
Chain 3 but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.
Chain 3 
Chain 3 Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
Chain 3 Exception: neg_binomial_2_log_lpmf: Precision parameter is 0, but must be positive finite! (in '/var/folders/s0/zfzm55px2nd2v__zlw5xfj2h0000gn/T/RtmpTngcHs/model-93a160694e4.stan', line 24, column 2 to column 44)
Chain 3 If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
Chain 3 but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.
Chain 3 
Chain 3 Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
Chain 3 Exception: neg_binomial_2_log_lpmf: Precision parameter is 0, but must be positive finite! (in '/var/folders/s0/zfzm55px2nd2v__zlw5xfj2h0000gn/T/RtmpTngcHs/model-93a160694e4.stan', line 24, column 2 to column 44)
Chain 3 If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
Chain 3 but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.
Chain 3 
Chain 3 Iteration:  100 / 2000 [  5%]  (Warmup) 
Chain 3 Iteration:  200 / 2000 [ 10%]  (Warmup) 
Chain 3 Iteration:  300 / 2000 [ 15%]  (Warmup) 
Chain 3 Iteration:  400 / 2000 [ 20%]  (Warmup) 
Chain 3 Iteration:  500 / 2000 [ 25%]  (Warmup) 
Chain 3 Iteration:  600 / 2000 [ 30%]  (Warmup) 
Chain 3 Iteration:  700 / 2000 [ 35%]  (Warmup) 
Chain 3 Iteration:  800 / 2000 [ 40%]  (Warmup) 
Chain 3 Iteration:  900 / 2000 [ 45%]  (Warmup) 
Chain 3 Iteration: 1000 / 2000 [ 50%]  (Warmup) 
Chain 3 Iteration: 1001 / 2000 [ 50%]  (Sampling) 
Chain 3 Iteration: 1100 / 2000 [ 55%]  (Sampling) 
Chain 3 Iteration: 1200 / 2000 [ 60%]  (Sampling) 
Chain 3 Iteration: 1300 / 2000 [ 65%]  (Sampling) 
Chain 3 Iteration: 1400 / 2000 [ 70%]  (Sampling) 
Chain 3 Iteration: 1500 / 2000 [ 75%]  (Sampling) 
Chain 3 Iteration: 1600 / 2000 [ 80%]  (Sampling) 
Chain 3 Iteration: 1700 / 2000 [ 85%]  (Sampling) 
Chain 3 Iteration: 1800 / 2000 [ 90%]  (Sampling) 
Chain 3 Iteration: 1900 / 2000 [ 95%]  (Sampling) 
Chain 3 Iteration: 2000 / 2000 [100%]  (Sampling) 
Chain 3 finished in 0.9 seconds.
Chain 4 Iteration:    1 / 2000 [  0%]  (Warmup) 
Chain 4 Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
Chain 4 Exception: neg_binomial_2_log_lpmf: Precision parameter is inf, but must be positive finite! (in '/var/folders/s0/zfzm55px2nd2v__zlw5xfj2h0000gn/T/RtmpTngcHs/model-93a160694e4.stan', line 24, column 2 to column 44)
Chain 4 If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
Chain 4 but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.
Chain 4 
Chain 4 Iteration:  100 / 2000 [  5%]  (Warmup) 
Chain 4 Iteration:  200 / 2000 [ 10%]  (Warmup) 
Chain 4 Iteration:  300 / 2000 [ 15%]  (Warmup) 
Chain 4 Iteration:  400 / 2000 [ 20%]  (Warmup) 
Chain 4 Iteration:  500 / 2000 [ 25%]  (Warmup) 
Chain 4 Iteration:  600 / 2000 [ 30%]  (Warmup) 
Chain 4 Iteration:  700 / 2000 [ 35%]  (Warmup) 
Chain 4 Iteration:  800 / 2000 [ 40%]  (Warmup) 
Chain 4 Iteration:  900 / 2000 [ 45%]  (Warmup) 
Chain 4 Iteration: 1000 / 2000 [ 50%]  (Warmup) 
Chain 4 Iteration: 1001 / 2000 [ 50%]  (Sampling) 
Chain 4 Iteration: 1100 / 2000 [ 55%]  (Sampling) 
Chain 4 Iteration: 1200 / 2000 [ 60%]  (Sampling) 
Chain 4 Iteration: 1300 / 2000 [ 65%]  (Sampling) 
Chain 4 Iteration: 1400 / 2000 [ 70%]  (Sampling) 
Chain 4 Iteration: 1500 / 2000 [ 75%]  (Sampling) 
Chain 4 Iteration: 1600 / 2000 [ 80%]  (Sampling) 
Chain 4 Iteration: 1700 / 2000 [ 85%]  (Sampling) 
Chain 4 Iteration: 1800 / 2000 [ 90%]  (Sampling) 
Chain 4 Iteration: 1900 / 2000 [ 95%]  (Sampling) 
Chain 4 Iteration: 2000 / 2000 [100%]  (Sampling) 
Chain 4 finished in 1.0 seconds.

All 4 chains finished successfully.
Mean chain execution time: 0.9 seconds.
Total execution time: 4.2 seconds.
mcmc_hist(fit_multi_NB$draws(c("inv_phi", "phi")))
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Let’s look at our predictions vs. the data.

y_rep <- fit_multi_NB$draws("y_rep", format = "matrix")
ppc_dens_overlay(y = standata_simple$complaints, yrep = y_rep[1:200,])

It appears that our model now captures both the number of small counts better as well as the tails.

Let’s check the proportion of zeros:

ppc_stat(
  y = standata_simple$complaints,
  yrep = y_rep,
  stat = "prop_zero",
  binwidth = 0.01
)

Check predictions by number of traps:

ppc_intervals(
  y = standata_simple$complaints,
  yrep = y_rep,
  x = standata_simple$traps + rnorm(standata_simple$N, 0, 0.1)
) +
  labs(
    x = "Number of traps (jittered)", 
    y = "Predicted number of complaints"
  )

We haven’t used the fact that the data are clustered by building yet. A posterior predictive check might elucidate whether it would be a good idea to add the building information into the model.

ppc_stat_grouped(
  y = standata_simple$complaints,
  yrep = fit_simple_pois$draws("y_rep", format = "matrix"),
  group = pest_data$building_id,
  stat = "mean",
  binwidth = 0.2
)

We’re getting plausible predictions for most building means but some are estimated better than others and some have large uncertainties. If we explicitly model the variation across buildings we may be able to get even better estimates.

Hierarchical modeling

Modeling varying intercepts for each building

Let’s add a hierarchical intercept parameter, \(\mu_b\) at the building level to our model.

\[ \text{complaints}_{b,t} \sim \text{Neg-Binomial}(\lambda_{b,t}, \phi) \\ \lambda_{b,t} = \exp{(\eta_{b,t})} \\ \eta_{b,t} = \mu_b + \beta \, {\rm traps}_{b,t} + \beta_{\rm super}\, {\rm super}_b + \text{log_sq_foot}_b \\ \mu_b \sim \text{Normal}(\alpha, \sigma_{\mu}) \]

In our Stan model, \(\mu_b\) is the \(b\)-th element of the vector \(\texttt{mu}\) which has one element per building.

One of our predictors varies only by building, so we can rewrite the above model to avoid some redundant calculations like this (doing \(\beta_{\rm super}\, {\rm super}_b\) only once per building):

\[ \eta_{b,t} = \mu_b + \beta \, {\rm traps}_{b,t} + \text{log_sq_foot}_b\\ \mu_b \sim \text{Normal}(\alpha + \beta_{\text{super}} \, \text{super}_b , \sigma_{\mu}) \]

We have more information at the building level as well, like the average age of the residents, the average age of the buildings, and the average per-apartment monthly rent so we can add that data into a matrix called building_data, which will have one row per building and four columns:

  • live_in_super
  • age_of_building
  • average_tentant_age
  • monthly_average_rent

We’ll write the Stan model like this (where mu is vectorized):

\[ \eta_{b,t} = \mu_b + \beta \, {\rm traps} + \text{log_sq_foot}\\ \mu \sim \text{Normal}(\alpha + \texttt{building_data} \, \zeta, \,\sigma_{\mu}) \]

Prepare building data for hierarchical model

We’ll need to do some more data prep before we can fit our models. Among other things, in order to use the building variable in Stan we will need to transform it from a factor variable to an integer variable.

N_months <- length(unique(pest_data$date))
N_buildings <- length(unique(pest_data$building_id))

# Add some IDs for building and month
pest_data <- pest_data %>%
  mutate(
    building_fac = factor(building_id, levels = unique(building_id)),
    building_idx = as.integer(building_fac),
    ids = rep(1:N_months, N_buildings),
    mo_idx = lubridate::month(date) # we'll use this later
  )

# Center and rescale the building specific data
building_data <- pest_data %>%
    select(
      building_idx,
      live_in_super,
      age_of_building,
      average_tenant_age,
      monthly_average_rent
    ) %>%
    unique() %>%
    arrange(building_idx) %>%
    select(-building_idx) %>%
    scale(scale=FALSE) %>% # this centers (subtracts mean) but doesn't divide by sd
    as.data.frame() %>%
    mutate( # scale by constants to put variables on similar scales
      age_of_building = age_of_building / 10,
      average_tenant_age = average_tenant_age / 10,
      monthly_average_rent = monthly_average_rent / 1000
    ) %>%
    as.matrix()

# Make data list for Stan
standata_hier <-
  with(pest_data,
        list(complaints = complaints,
             N = length(complaints),
             traps = traps,
             log_sq_foot = log(pest_data$total_sq_foot/1e4),
             M = N_months,
             mo_idx = as.integer(as.factor(date)),
             J = N_buildings,
             K = ncol(building_data), # number of building-specific variables
             building_idx = building_idx,
             building_data = building_data
             )
        )

Compile and fit the hierarchical model

Let’s compile the model.

comp_hier_NB <- cmdstan_model("stan_programs/hier_NB_regression.stan")

Fit the model to data.

fit_hier_NB <-
  comp_hier_NB$sample(
    data = standata_hier,
    refresh = 500,
    # can run markov chains in parallel. you can check how many cores you
    # have available using parallel::detectCores()
    chains = 4,
    parallel_chains = 4
  )
Running MCMC with 4 parallel chains...

Chain 1 Iteration:    1 / 2000 [  0%]  (Warmup) 
Chain 2 Iteration:    1 / 2000 [  0%]  (Warmup) 
Chain 2 Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
Chain 2 Exception: neg_binomial_2_log_lpmf: Precision parameter is 0, but must be positive finite! (in '/var/folders/s0/zfzm55px2nd2v__zlw5xfj2h0000gn/T/RtmpFNe7A9/model-72cc64ca365c.stan', line 27, column 2 to column 44)
Chain 2 If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
Chain 2 but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.
Chain 2 
Chain 3 Iteration:    1 / 2000 [  0%]  (Warmup) 
Chain 3 Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
Chain 3 Exception: neg_binomial_2_log_lpmf: Precision parameter is 0, but must be positive finite! (in '/var/folders/s0/zfzm55px2nd2v__zlw5xfj2h0000gn/T/RtmpFNe7A9/model-72cc64ca365c.stan', line 27, column 2 to column 44)
Chain 3 If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
Chain 3 but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.
Chain 3 
Chain 4 Iteration:    1 / 2000 [  0%]  (Warmup) 
Chain 4 Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
Chain 4 Exception: neg_binomial_2_log_lpmf: Precision parameter is 0, but must be positive finite! (in '/var/folders/s0/zfzm55px2nd2v__zlw5xfj2h0000gn/T/RtmpFNe7A9/model-72cc64ca365c.stan', line 27, column 2 to column 44)
Chain 4 If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
Chain 4 but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.
Chain 4 
Chain 1 Iteration:  500 / 2000 [ 25%]  (Warmup) 
Chain 3 Iteration:  500 / 2000 [ 25%]  (Warmup) 
Chain 4 Iteration:  500 / 2000 [ 25%]  (Warmup) 
Chain 2 Iteration:  500 / 2000 [ 25%]  (Warmup) 
Chain 1 Iteration: 1000 / 2000 [ 50%]  (Warmup) 
Chain 1 Iteration: 1001 / 2000 [ 50%]  (Sampling) 
Chain 3 Iteration: 1000 / 2000 [ 50%]  (Warmup) 
Chain 3 Iteration: 1001 / 2000 [ 50%]  (Sampling) 
Chain 4 Iteration: 1000 / 2000 [ 50%]  (Warmup) 
Chain 4 Iteration: 1001 / 2000 [ 50%]  (Sampling) 
Chain 2 Iteration: 1000 / 2000 [ 50%]  (Warmup) 
Chain 2 Iteration: 1001 / 2000 [ 50%]  (Sampling) 
Chain 1 Iteration: 1500 / 2000 [ 75%]  (Sampling) 
Chain 3 Iteration: 1500 / 2000 [ 75%]  (Sampling) 
Chain 2 Iteration: 1500 / 2000 [ 75%]  (Sampling) 
Chain 4 Iteration: 1500 / 2000 [ 75%]  (Sampling) 
Chain 1 Iteration: 2000 / 2000 [100%]  (Sampling) 
Chain 1 finished in 1.6 seconds.
Chain 3 Iteration: 2000 / 2000 [100%]  (Sampling) 
Chain 3 finished in 1.5 seconds.
Chain 2 Iteration: 2000 / 2000 [100%]  (Sampling) 
Chain 2 finished in 1.7 seconds.
Chain 4 Iteration: 2000 / 2000 [100%]  (Sampling) 
Chain 4 finished in 1.7 seconds.

All 4 chains finished successfully.
Mean chain execution time: 1.6 seconds.
Total execution time: 2.0 seconds.
Warning: 200 of 4000 (5.0%) transitions ended with a divergence.
See https://mc-stan.org/misc/warnings for details.

We get a bunch of warnings from Stan about divergent transitions, which is an indication that there may be regions of the posterior that have not been explored by the Markov chains.

In order to figure out what this means and what to do about it we’ll pause here to discuss MCMC and Stan’s implementation in particular. To do that we’ll use this interactive MCMC demo:

Diagnostics

Divergences are discussed in more detail in the case study Diagnosing Biased Inference with Divergences, the bayesplot MCMC diagnostics vignette and A Conceptual Introduction to Hamiltonian Monte Carlo.

In this example we will see that we have divergent transitions because we need to reparameterize our model - i.e., we will retain the overall structure of the model, but transform some of the parameters so that it is easier for Stan to sample from the parameter space. Before we go through exactly how to do this reparameterization, we will first go through what indicates that this is something that reparameterization will resolve. We will go through:

  1. Examining the fitted parameter values, including the effective sample size
  2. Plots that reveal particular patterns in locations of the divergences.

We can check the effective sample sizes by printing the parameters that are of most interest.

fit_hier_NB$diagnostic_summary()
Warning: 200 of 4000 (5.0%) transitions ended with a divergence.
See https://mc-stan.org/misc/warnings for details.
$num_divergent
[1] 43 61 62 34

$num_max_treedepth
[1] 0 0 0 0

$ebfmi
[1] 0.4858622 0.5864106 0.4112685 0.5364758
fit_hier_NB$summary(variables = c("sigma_mu", "beta", "alpha", "phi", "mu"))
# A tibble: 14 × 10
   variable   mean median     sd    mad      q5    q95  rhat ess_bulk ess_tail
   <chr>     <dbl>  <dbl>  <dbl>  <dbl>   <dbl>  <dbl> <dbl>    <dbl>    <dbl>
 1 sigma_mu  0.269  0.232 0.166  0.141   0.0770  0.588  1.01     271.     135.
 2 beta     -0.244 -0.243 0.0594 0.0587 -0.341  -0.146  1.01     579.     902.
 3 alpha     1.36   1.36  0.427  0.425   0.669   2.07   1.01     571.     869.
 4 phi       1.59   1.55  0.359  0.343   1.09    2.25   1.00     902.     950.
 5 mu[1]     1.39   1.39  0.544  0.519   0.492   2.31   1.01     640.     889.
 6 mu[2]     1.34   1.34  0.525  0.520   0.499   2.22   1.01     651.    1003.
 7 mu[3]     1.52   1.51  0.480  0.482   0.758   2.31   1.00     643.    1118.
 8 mu[4]     1.55   1.55  0.484  0.471   0.767   2.36   1.01     634.     940.
 9 mu[5]     1.16   1.16  0.416  0.413   0.475   1.83   1.01     712.    1279.
10 mu[6]     1.26   1.26  0.478  0.480   0.471   2.03   1.01     633.    1213.
11 mu[7]     1.57   1.56  0.513  0.507   0.720   2.38   1.01     620.    1191.
12 mu[8]     1.34   1.33  0.420  0.414   0.659   2.03   1.00     678.    1199.
13 mu[9]     1.52   1.53  0.556  0.552   0.607   2.43   1.00     615.     925.
14 mu[10]    0.917  0.905 0.369  0.378   0.346   1.52   1.01     799.    1557.

You can see that the effective sample sizes are quite low for many of the parameters relative to the total number of samples (4000). This alone isn’t indicative of the need to reparameterize, but it indicates that we should look further at the trace plots and pairs plots. First let’s look at the traceplots to see if the divergent transitions form a pattern. We’ll look at the new sigma_mu parameter, which is the hierarchical standard deviation parameter we added to the model.

# change bayesplot color scheme (to make chains very different colors)
color_scheme_set("brewer-Spectral")
mcmc_trace(
  # use as.array to keep the markov chains separate for trace plots
  fit_hier_NB$draws("sigma_mu"),
  np = nuts_params(fit_hier_NB),
  window = c(500, 1000) # zoom into a window to see more clearly
)

color_scheme_set("blue")

Looks like the divergences (marked in red below the traceplot) correspond to spots where sampler gets stuck.

# assign to object so we can compare to another plot later

scatter_with_divs <- mcmc_scatter(
  fit_hier_NB$draws(),
  pars = c("mu[4]", "sigma_mu"),
  # look at sigma on log-scale (what Stan uses under the hood for positive parameters)
  transform = list("sigma_mu" = "log"),
  size = 1,
  alpha = 0.3,
  np = nuts_params(fit_hier_NB),
  np_style = scatter_style_np(div_size = 1.5) # change style of divergences points
)
scatter_with_divs + coord_equal()

What we have here is a cloud-like shape, with most of the divergences clustering towards the bottom. We’ll see a bit later that we actually want this to look more like a funnel than a cloud, but the divergences are indicating that the sampler can’t explore the narrowing neck of the funnel.

One way to see why we should expect some version of a funnel is to look at some simulations from the prior, which we can do without MCMC and thus with no risk of sampling problems:

draw_from_prior <- function() {
  sigma_mu_prior <- abs(rnorm(1, mean = 0, sd = 1))
  mu_prior <- rnorm(1, mean = 0, sd = sigma_mu_prior) 
  c("mu" = mu_prior, "log(sigma_mu)" = log(sigma_mu_prior))
}

# plot a bunch of draws
prior_draws <- t(replicate(1000, draw_from_prior()))
mcmc_scatter(prior_draws)

Of course, if the data is at all informative we shouldn’t expect the posterior to look exactly like the prior. But unless the data is incredibly informative about the parameters (and the posterior concentrates away from the narrow neck of the funnel) the sampler is going to have to confront the funnel geometry. (See the bayesplot Visual MCMC Diagnostics vignette for more on this.)

Reparameterize and recheck diagnostics

Instead, we should use the non-centered parameterization for \(\mu_b\). This involves taking standard normal random variables and shifting and scaling them to have the mean and standard deviation we want using properties of the normal distribution.

\[ \mu_{\rm raw} \sim \text{Normal}(0,1) \\ \mu = \text{mean} + \text{sd} \times \mu_{\rm raw} \\ \mu = (\alpha + \texttt{building_data} \, \zeta) + \sigma_{\mu} \mu_{\rm raw} \]

This implies \(\texttt{mu}\) is distributed \(\text{Normal}(\alpha + \texttt{building_data}\, \zeta, \sigma_\mu)\) as before, but it decouples the dependence of the density of each element of \(\texttt{mu}\) from \(\texttt{sigma_mu}\) (\(\sigma_\mu\)) and makes the geometry easier for the sampler to navigate.

draw_from_prior_non_centered <- function() {
  sigma_mu_prior <- abs(rnorm(1, mean = 0, sd = 1))
  mu_raw_prior <- rnorm(1, mean = 0, sd = 1)
  c("mu_raw" = mu_raw_prior, "log(sigma_mu)" = log(sigma_mu_prior))
}

# plot a bunch of draws
prior_draws <- t(replicate(1000, draw_from_prior_non_centered()))
mcmc_scatter(prior_draws)

To implement this in the Stan program we need to put \(\texttt{mu_raw}\) in the parameters block (with a \(\text{Normal}(0, 1)\) prior in the model block), and then define \(\texttt{mu}\) in the transformed parameters block.

transformed parameters {
  vector[J] mu = alpha + building_data * zeta + sigma_mu * mu_raw;
}
comp_hier_NB_ncp <- cmdstan_model("stan_programs/hier_NB_regression_ncp.stan")

Fit the model to the data.

fit_hier_NB_ncp <- comp_hier_NB_ncp$sample(
  data = standata_hier,
  refresh = 500,
  chains = 4,
  parallel_chains = 4
)
Chain 1 Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
Chain 1 Exception: neg_binomial_2_log_lpmf: Precision parameter is 0, but must be positive finite! (in '/var/folders/s0/zfzm55px2nd2v__zlw5xfj2h0000gn/T/RtmpTngcHs/model-93a28003f65.stan', line 28, column 2 to column 44)
Chain 1 If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
Chain 1 but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.
Chain 1 
Chain 2 Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
Chain 2 Exception: neg_binomial_2_log_lpmf: Precision parameter is inf, but must be positive finite! (in '/var/folders/s0/zfzm55px2nd2v__zlw5xfj2h0000gn/T/RtmpTngcHs/model-93a28003f65.stan', line 28, column 2 to column 44)
Chain 2 If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
Chain 2 but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.
Chain 2 
Chain 3 Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
Chain 3 Exception: neg_binomial_2_log_lpmf: Precision parameter is 0, but must be positive finite! (in '/var/folders/s0/zfzm55px2nd2v__zlw5xfj2h0000gn/T/RtmpTngcHs/model-93a28003f65.stan', line 28, column 2 to column 44)
Chain 3 If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
Chain 3 but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.
Chain 3 
Chain 3 Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
Chain 3 Exception: neg_binomial_2_log_lpmf: Precision parameter is 0, but must be positive finite! (in '/var/folders/s0/zfzm55px2nd2v__zlw5xfj2h0000gn/T/RtmpTngcHs/model-93a28003f65.stan', line 28, column 2 to column 44)
Chain 3 If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
Chain 3 but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.
Chain 3 
Chain 3 Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
Chain 3 Exception: neg_binomial_2_log_lpmf: Precision parameter is 0, but must be positive finite! (in '/var/folders/s0/zfzm55px2nd2v__zlw5xfj2h0000gn/T/RtmpTngcHs/model-93a28003f65.stan', line 28, column 2 to column 44)
Chain 3 If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
Chain 3 but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.
Chain 3 
Chain 3 Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
Chain 3 Exception: neg_binomial_2_log_lpmf: Log location parameter[1] is -inf, but must be finite! (in '/var/folders/s0/zfzm55px2nd2v__zlw5xfj2h0000gn/T/RtmpTngcHs/model-93a28003f65.stan', line 28, column 2 to column 44)
Chain 3 If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
Chain 3 but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.
Chain 3 
Chain 3 Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
Chain 3 Exception: neg_binomial_2_log_lpmf: Precision parameter is 0, but must be positive finite! (in '/var/folders/s0/zfzm55px2nd2v__zlw5xfj2h0000gn/T/RtmpTngcHs/model-93a28003f65.stan', line 28, column 2 to column 44)
Chain 3 If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
Chain 3 but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.
Chain 3 
Chain 4 Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
Chain 4 Exception: neg_binomial_2_log_lpmf: Precision parameter is 0, but must be positive finite! (in '/var/folders/s0/zfzm55px2nd2v__zlw5xfj2h0000gn/T/RtmpTngcHs/model-93a28003f65.stan', line 28, column 2 to column 44)
Chain 4 If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
Chain 4 but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.
Chain 4 
Running MCMC with 4 parallel chains...

Chain 1 Iteration:    1 / 2000 [  0%]  (Warmup) 
Chain 2 Iteration:    1 / 2000 [  0%]  (Warmup) 
Chain 3 Iteration:    1 / 2000 [  0%]  (Warmup) 
Chain 4 Iteration:    1 / 2000 [  0%]  (Warmup) 
Chain 1 Iteration:  500 / 2000 [ 25%]  (Warmup) 
Chain 1 Iteration: 1000 / 2000 [ 50%]  (Warmup) 
Chain 1 Iteration: 1001 / 2000 [ 50%]  (Sampling) 
Chain 2 Iteration:  500 / 2000 [ 25%]  (Warmup) 
Chain 3 Iteration:  500 / 2000 [ 25%]  (Warmup) 
Chain 4 Iteration:  500 / 2000 [ 25%]  (Warmup) 
Chain 2 Iteration: 1000 / 2000 [ 50%]  (Warmup) 
Chain 2 Iteration: 1001 / 2000 [ 50%]  (Sampling) 
Chain 1 Iteration: 1500 / 2000 [ 75%]  (Sampling) 
Chain 3 Iteration: 1000 / 2000 [ 50%]  (Warmup) 
Chain 3 Iteration: 1001 / 2000 [ 50%]  (Sampling) 
Chain 4 Iteration: 1000 / 2000 [ 50%]  (Warmup) 
Chain 4 Iteration: 1001 / 2000 [ 50%]  (Sampling) 
Chain 2 Iteration: 1500 / 2000 [ 75%]  (Sampling) 
Chain 1 Iteration: 2000 / 2000 [100%]  (Sampling) 
Chain 3 Iteration: 1500 / 2000 [ 75%]  (Sampling) 
Chain 4 Iteration: 1500 / 2000 [ 75%]  (Sampling) 
Chain 1 finished in 1.8 seconds.
Chain 2 Iteration: 2000 / 2000 [100%]  (Sampling) 
Chain 2 finished in 1.9 seconds.
Chain 3 Iteration: 2000 / 2000 [100%]  (Sampling) 
Chain 3 finished in 2.0 seconds.
Chain 4 Iteration: 2000 / 2000 [100%]  (Sampling) 
Chain 4 finished in 2.0 seconds.

All 4 chains finished successfully.
Mean chain execution time: 1.9 seconds.
Total execution time: 2.4 seconds.

You should have no divergences (or only a few) this time and much improved effective sample sizes:

fit_hier_NB_ncp$summary(variables = c("sigma_mu", "beta", "alpha", "phi", "mu"))
# A tibble: 14 × 10
   variable   mean median     sd    mad      q5    q95  rhat ess_bulk ess_tail
   <chr>     <dbl>  <dbl>  <dbl>  <dbl>   <dbl>  <dbl> <dbl>    <dbl>    <dbl>
 1 sigma_mu  0.234  0.201 0.177  0.157   0.0214  0.567  1.00    1155.    1690.
 2 beta     -0.246 -0.247 0.0620 0.0609 -0.345  -0.144  1.00    2826.    2751.
 3 alpha     1.37   1.37  0.436  0.429   0.661   2.09   1.00    2700.    2960.
 4 phi       1.58   1.53  0.349  0.336   1.08    2.20   1.00    4755.    2552.
 5 mu[1]     1.41   1.42  0.556  0.556   0.492   2.32   1.00    2669.    2750.
 6 mu[2]     1.35   1.34  0.539  0.536   0.476   2.24   1.00    2843.    3109.
 7 mu[3]     1.51   1.52  0.491  0.488   0.711   2.31   1.00    3119.    3250.
 8 mu[4]     1.55   1.54  0.493  0.483   0.752   2.37   1.00    2987.    3101.
 9 mu[5]     1.17   1.18  0.419  0.412   0.480   1.84   1.00    2940.    3187.
10 mu[6]     1.30   1.30  0.495  0.484   0.462   2.09   1.00    2774.    2917.
11 mu[7]     1.58   1.59  0.530  0.520   0.712   2.45   1.00    3074.    2952.
12 mu[8]     1.34   1.34  0.433  0.433   0.651   2.07   1.00    3295.    3433.
13 mu[9]     1.55   1.57  0.577  0.575   0.578   2.45   1.00    2846.    2709.
14 mu[10]    0.941  0.933 0.380  0.378   0.334   1.58   1.00    3080.    3216.

Let’s make the same scatter plot we made for the model with many divergences and compare:

scatter_no_divs <- mcmc_scatter(
  fit_hier_NB_ncp$draws(),
  pars = c("mu[4]", "sigma_mu"),
  transform = list("sigma_mu" = "log"),
  size = 1,
  alpha = 0.3,
  np = nuts_params(fit_hier_NB_ncp)
)

bayesplot_grid(
  scatter_with_divs, scatter_no_divs,
  grid_args = list(ncol = 2),
  ylim = c(-10, 1),
  subtitles = c("Centered parameterization",
                "Non-centered parameterization")
)

Comparing the plots, we see that the divergences were a sign that there was part of the posterior that the chains were not exploring.

The marginal PPC plot, again.

y_rep <- fit_hier_NB_ncp$draws("y_rep", format = "matrix")
ppc_dens_overlay(standata_hier$complaints, y_rep[1:200,])

This looks quite nice.

If we’ve captured the building-level means well, then the distribution of predicted means by building should match well with the observed means of the quantity of building complaints by month.

ppc_stat_grouped(
  y = standata_hier$complaints,
  yrep = y_rep,
  group = pest_data$building_id,
  stat = "mean",
  binwidth = 0.5
)

We weren’t doing terribly with the building-specific means before, but now they are all estimated much more precisely. For example, let’s compare the standard deviation of the y_rep for building 47 from the non-hierarchical and hierarchical models:

y_rep_47_hier <- y_rep[, pest_data$building_id == 47]
y_rep_47_old <- fit_multi_NB$draws("y_rep", format = "matrix")[, pest_data$building_id == 47]

c("Non-hierarchical" = sd(y_rep_47_old), "Hierarchical" = sd(y_rep_47_hier))
Non-hierarchical     Hierarchical 
        6.407346         5.786515 

The model is also able to do a decent job estimating the probability of zero complaints:

prop_zero <- function(x) mean(x == 0)
ppc_stat(
  y = standata_hier$complaints,
  yrep = y_rep,
  stat = prop_zero,
  binwidth = 0.02
)

# plot separately for each building
ppc_stat_grouped(
  y = standata_hier$complaints,
  yrep = y_rep,
  group = pest_data$building_id,
  stat = prop_zero,
  binwidth = 0.1
)

Check quantiles of predictions:

q90 <- function(x) quantile(x, probs = 0.9)
ppc_stat(
  y = standata_hier$complaints,
  yrep = y_rep,
  stat = "q90",
  binwidth = 1
)

Predictions by number of traps:

ppc_intervals_grouped(
  y = standata_hier$complaints,
  yrep = y_rep,
  x = standata_hier$traps + rnorm(standata_hier$N, 0, 0.1), # jitter
  group = standata_hier$building_idx
  # difference between grouping by pest_data$building_id and standata_hier$building_idx
  # is that the latter is consecutive from 1 to N_buildings to make things easier
  # to code in Stan
) +
  labs(
    x = "Number of traps (jittered)",
    y = "Predicted number of complaints",
    title = "Traps vs predicted complaints by building"
  )

Varying intercepts and varying slopes

We have some new data that extends the number of time points for which we have observations for each building. This will let us explore how to expand the model a bit more with varying slopes in addition to the varying intercepts and later also model temporal variation.

standata_hier_long <- readRDS("data/standata_hier_long.RDS")

Perhaps if the levels of complaints differ by building, the coefficient for the effect of traps on building does too. We can add this to our model and observe the fit.

\[ \text{complaints}_{b,t} \sim \text{Neg-Binomial}(\lambda_{b,t}, \phi) \\ \lambda_{b,t} = \exp{(\eta_{b,t})}\\ \eta_{b,t} = \mu_b + \kappa_b \, \texttt{traps}_{b,t} + \text{log_sq_foot}_b \\ \mu_b \sim \text{Normal}(\alpha + \texttt{building_data} \, \zeta, \sigma_{\mu}) \\ \kappa_b \sim \text{Normal}(\beta + \texttt{building_data} \, \gamma, \sigma_{\kappa}) \]

Let’s compile and run the model.

comp_hier_NB_slopes <- cmdstan_model("stan_programs/hier_NB_regression_ncp_slopes.stan")
fit_hier_NB_slopes <-
  comp_hier_NB_slopes$sample(
    data = standata_hier_long,
    chains = 4,
    parallel_chains = 4,
    refresh = 200,
    adapt_delta = 0.95
  )
Chain 1 Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
Chain 1 Exception: neg_binomial_2_log_lpmf: Precision parameter is 0, but must be positive finite! (in '/var/folders/s0/zfzm55px2nd2v__zlw5xfj2h0000gn/T/RtmpFNe7A9/model-72cc6fcb46c6.stan', line 41, column 2 to column 44)
Chain 1 If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
Chain 1 but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.
Chain 1 
Chain 1 Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
Chain 1 Exception: neg_binomial_2_log_lpmf: Precision parameter is 0, but must be positive finite! (in '/var/folders/s0/zfzm55px2nd2v__zlw5xfj2h0000gn/T/RtmpFNe7A9/model-72cc6fcb46c6.stan', line 41, column 2 to column 44)
Chain 1 If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
Chain 1 but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.
Chain 1 
Chain 1 Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
Chain 1 Exception: neg_binomial_2_log_lpmf: Log location parameter[1] is -inf, but must be finite! (in '/var/folders/s0/zfzm55px2nd2v__zlw5xfj2h0000gn/T/RtmpFNe7A9/model-72cc6fcb46c6.stan', line 41, column 2 to column 44)
Chain 1 If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
Chain 1 but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.
Chain 1 
Chain 2 Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
Chain 2 Exception: neg_binomial_2_log_lpmf: Precision parameter is 0, but must be positive finite! (in '/var/folders/s0/zfzm55px2nd2v__zlw5xfj2h0000gn/T/RtmpFNe7A9/model-72cc6fcb46c6.stan', line 41, column 2 to column 44)
Chain 2 If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
Chain 2 but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.
Chain 2 
Chain 2 Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
Chain 2 Exception: neg_binomial_2_log_lpmf: Precision parameter is 0, but must be positive finite! (in '/var/folders/s0/zfzm55px2nd2v__zlw5xfj2h0000gn/T/RtmpFNe7A9/model-72cc6fcb46c6.stan', line 41, column 2 to column 44)
Chain 2 If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
Chain 2 but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.
Chain 2 
Chain 2 Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
Chain 2 Exception: neg_binomial_2_log_lpmf: Precision parameter is 0, but must be positive finite! (in '/var/folders/s0/zfzm55px2nd2v__zlw5xfj2h0000gn/T/RtmpFNe7A9/model-72cc6fcb46c6.stan', line 41, column 2 to column 44)
Chain 2 If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
Chain 2 but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.
Chain 2 
Chain 3 Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
Chain 3 Exception: neg_binomial_2_log_lpmf: Precision parameter is 0, but must be positive finite! (in '/var/folders/s0/zfzm55px2nd2v__zlw5xfj2h0000gn/T/RtmpFNe7A9/model-72cc6fcb46c6.stan', line 41, column 2 to column 44)
Chain 3 If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
Chain 3 but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.
Chain 3 
Chain 4 Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
Chain 4 Exception: neg_binomial_2_log_lpmf: Precision parameter is 0, but must be positive finite! (in '/var/folders/s0/zfzm55px2nd2v__zlw5xfj2h0000gn/T/RtmpFNe7A9/model-72cc6fcb46c6.stan', line 41, column 2 to column 44)
Chain 4 If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
Chain 4 but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.
Chain 4 
Chain 4 Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
Chain 4 Exception: neg_binomial_2_log_lpmf: Precision parameter is 0, but must be positive finite! (in '/var/folders/s0/zfzm55px2nd2v__zlw5xfj2h0000gn/T/RtmpFNe7A9/model-72cc6fcb46c6.stan', line 41, column 2 to column 44)
Chain 4 If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
Chain 4 but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.
Chain 4 
Chain 4 Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
Chain 4 Exception: neg_binomial_2_log_lpmf: Log location parameter[1] is -inf, but must be finite! (in '/var/folders/s0/zfzm55px2nd2v__zlw5xfj2h0000gn/T/RtmpFNe7A9/model-72cc6fcb46c6.stan', line 41, column 2 to column 44)
Chain 4 If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
Chain 4 but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.
Chain 4 
Warning: 4 of 4000 (0.0%) transitions ended with a divergence.
See https://mc-stan.org/misc/warnings for details.
Running MCMC with 4 parallel chains...

Chain 1 Iteration:    1 / 2000 [  0%]  (Warmup) 
Chain 2 Iteration:    1 / 2000 [  0%]  (Warmup) 
Chain 3 Iteration:    1 / 2000 [  0%]  (Warmup) 
Chain 4 Iteration:    1 / 2000 [  0%]  (Warmup) 
Chain 1 Iteration:  200 / 2000 [ 10%]  (Warmup) 
Chain 3 Iteration:  200 / 2000 [ 10%]  (Warmup) 
Chain 4 Iteration:  200 / 2000 [ 10%]  (Warmup) 
Chain 2 Iteration:  200 / 2000 [ 10%]  (Warmup) 
Chain 1 Iteration:  400 / 2000 [ 20%]  (Warmup) 
Chain 4 Iteration:  400 / 2000 [ 20%]  (Warmup) 
Chain 2 Iteration:  400 / 2000 [ 20%]  (Warmup) 
Chain 3 Iteration:  400 / 2000 [ 20%]  (Warmup) 
Chain 1 Iteration:  600 / 2000 [ 30%]  (Warmup) 
Chain 4 Iteration:  600 / 2000 [ 30%]  (Warmup) 
Chain 2 Iteration:  600 / 2000 [ 30%]  (Warmup) 
Chain 3 Iteration:  600 / 2000 [ 30%]  (Warmup) 
Chain 1 Iteration:  800 / 2000 [ 40%]  (Warmup) 
Chain 4 Iteration:  800 / 2000 [ 40%]  (Warmup) 
Chain 3 Iteration:  800 / 2000 [ 40%]  (Warmup) 
Chain 2 Iteration:  800 / 2000 [ 40%]  (Warmup) 
Chain 4 Iteration: 1000 / 2000 [ 50%]  (Warmup) 
Chain 4 Iteration: 1001 / 2000 [ 50%]  (Sampling) 
Chain 1 Iteration: 1000 / 2000 [ 50%]  (Warmup) 
Chain 1 Iteration: 1001 / 2000 [ 50%]  (Sampling) 
Chain 3 Iteration: 1000 / 2000 [ 50%]  (Warmup) 
Chain 3 Iteration: 1001 / 2000 [ 50%]  (Sampling) 
Chain 2 Iteration: 1000 / 2000 [ 50%]  (Warmup) 
Chain 2 Iteration: 1001 / 2000 [ 50%]  (Sampling) 
Chain 4 Iteration: 1200 / 2000 [ 60%]  (Sampling) 
Chain 3 Iteration: 1200 / 2000 [ 60%]  (Sampling) 
Chain 2 Iteration: 1200 / 2000 [ 60%]  (Sampling) 
Chain 1 Iteration: 1200 / 2000 [ 60%]  (Sampling) 
Chain 4 Iteration: 1400 / 2000 [ 70%]  (Sampling) 
Chain 3 Iteration: 1400 / 2000 [ 70%]  (Sampling) 
Chain 2 Iteration: 1400 / 2000 [ 70%]  (Sampling) 
Chain 1 Iteration: 1400 / 2000 [ 70%]  (Sampling) 
Chain 4 Iteration: 1600 / 2000 [ 80%]  (Sampling) 
Chain 3 Iteration: 1600 / 2000 [ 80%]  (Sampling) 
Chain 2 Iteration: 1600 / 2000 [ 80%]  (Sampling) 
Chain 1 Iteration: 1600 / 2000 [ 80%]  (Sampling) 
Chain 4 Iteration: 1800 / 2000 [ 90%]  (Sampling) 
Chain 3 Iteration: 1800 / 2000 [ 90%]  (Sampling) 
Chain 2 Iteration: 1800 / 2000 [ 90%]  (Sampling) 
Chain 4 Iteration: 2000 / 2000 [100%]  (Sampling) 
Chain 4 finished in 16.4 seconds.
Chain 1 Iteration: 1800 / 2000 [ 90%]  (Sampling) 
Chain 3 Iteration: 2000 / 2000 [100%]  (Sampling) 
Chain 3 finished in 16.7 seconds.
Chain 2 Iteration: 2000 / 2000 [100%]  (Sampling) 
Chain 2 finished in 17.0 seconds.
Chain 1 Iteration: 2000 / 2000 [100%]  (Sampling) 
Chain 1 finished in 18.1 seconds.

All 4 chains finished successfully.
Mean chain execution time: 17.1 seconds.
Total execution time: 18.3 seconds.

We may or may not get a few divergences here, but as long as there aren’t many of them clustered together it’s usually safe to ignore (as discussed in the Visualization in Bayesian Workflow paper).

fit_hier_NB_slopes$summary(variables = c("alpha", "beta", "sigma_kappa", "kappa", "sigma_mu", "mu"))
# A tibble: 24 × 10
   variable    mean  median     sd    mad      q5    q95  rhat ess_bulk ess_tail
   <chr>      <dbl>   <dbl>  <dbl>  <dbl>   <dbl>  <dbl> <dbl>    <dbl>    <dbl>
 1 alpha     1.45    1.46   0.302  0.286   0.950   1.94   1.00    3156.    2393.
 2 beta     -0.348  -0.349  0.0611 0.0531 -0.443  -0.257  1.00    2629.    1992.
 3 sigma_k…  0.123   0.101  0.0852 0.0582  0.0379  0.276  1.00     820.    1219.
 4 kappa[1] -0.0249 -0.0343 0.0737 0.0661 -0.126   0.111  1.00    1660.    1962.
 5 kappa[2] -0.419  -0.412  0.0957 0.0958 -0.587  -0.271  1.00    2296.    2778.
 6 kappa[3] -0.588  -0.587  0.104  0.104  -0.760  -0.418  1.00    5530.    3415.
 7 kappa[4] -0.224  -0.223  0.0673 0.0621 -0.333  -0.114  1.00    3937.    2880.
 8 kappa[5] -0.608  -0.608  0.0906 0.0881 -0.757  -0.457  1.00    4323.    2999.
 9 kappa[6] -0.437  -0.431  0.0993 0.0891 -0.614  -0.286  1.00    4021.    2736.
10 kappa[7] -0.308  -0.309  0.0652 0.0657 -0.416  -0.202  1.00    4857.    3301.
# ℹ 14 more rows

To see if the model infers building-to-building differences in the coefficient on traps, we can plot a histogram of our marginal posterior distribution for sigma_kappa.

mcmc_hist(fit_hier_NB_slopes$draws("sigma_kappa"), binwidth = 0.01)

While the model can’t specifically rule out zero from the posterior, it does have mass at small non-zero numbers, so we will leave in the hierarchy over \(\texttt{kappa}\). Plotting the marginal data density again, we can see the model still looks well calibrated.

y_rep <- fit_hier_NB_slopes$draws("y_rep", format = "matrix")
ppc_dens_overlay(
  y = standata_hier_long$complaints,
  yrep = y_rep[1:200,]
)

Check quantiles of predictions:

q90 <- function(x) quantile(x, probs = 0.9)
ppc_stat(
  y = standata_hier_long$complaints,
  yrep = y_rep,
  stat = "q90"
)
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Time varying effects and structured priors

We haven’t looked at how cockroach complaints change over time. Let’s look at plots over time. Each panel below is a month and we can see that we’re making basically the same predictions for each month since we haven’t encoding any time structure into the model:

y_rep <- fit_hier_NB_slopes$draws("y_rep", format = "matrix")
select_vec <- which(standata_hier_long$mo_idx %in% 1:12)
ppc_stat_grouped(
  y = standata_hier_long$complaints[select_vec],
  yrep = y_rep[,select_vec],
  group = standata_hier_long$mo_idx[select_vec],
  stat = 'mean'
) + xlim(0, 11)
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

To address this we might augment our model with a log-additive monthly effect, \(\texttt{mo}_t\).

\[ \eta_{b,t} = \mu_b + \kappa_b \, \texttt{traps}_{b,t} + \texttt{mo}_t + \text{log_sq_foot}_b \]

We have complete freedom over how to specify the prior for \(\texttt{mo}_t\). There are several competing factors for how the number of complaints might change over time. It makes sense that there might be more roaches in the environment during the summer, but we might also expect that there is more roach control in the summer as well. Given that we’re modeling complaints, maybe after the first sighting of roaches in a building, residents are more vigilant, and thus complaints of roaches would increase.

This can be a motivation for using an autoregressive prior for our monthly effects. The model is as follows:

\[ \texttt{mo}_t | \texttt{mo}_{t-1} \sim \text{Normal}(\rho \, \texttt{mo}_{t-1}, \sigma_\texttt{mo}) \\ \equiv \\ \texttt{mo}_t = \rho \, \texttt{mo}_{t-1} +\epsilon_t, \quad \epsilon_t \sim \text{Normal}(0, \sigma_\texttt{mo}) \\ \quad \rho \in (-1,1) \]

This equation says that the monthly effect in month \(t\) is directly related to the last month’s monthly effect. Given the description of the process above, it seems like there could be either positive or negative associations between the months, but there should be a bit more weight placed on positive \(\rho\)s, so we’ll put an informative prior that pushes the parameter \(\rho\) towards 0.5.

Before we write our prior, however, we have a problem: Stan doesn’t implement any densities that have support on \([-1,1]\). We can use variable transformation of a raw variable defined on \([0,1]\) to to give us a density on \([-1,1]\). Specifically,

\[ \rho_{\text{raw}} \in [0, 1] \\ \rho = 2 \times \rho_{\text{raw}} - 1 \]

Then we can use a beta prior on \(\rho_\text{raw}\) to push our estimate of \(\rho\) towards 0.5.

For example a rho_raw ~ beta(10, 5) prior would result in the following:

rho_raw <- rbeta(1000, 10, 5)
rho <- 2 * rho_raw - 1
rho_priors <- cbind(rho_raw, rho)
mcmc_hist(rho_priors, facet_args = list(nrow = 2)) + xlim(-1, 1)
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

One further wrinkle is that we have a prior for \(\texttt{mo}_t\) that depends on \(\texttt{mo}_{t-1}\). That is, we are working with the conditional distribution of \(\texttt{mo}_t\) given \(\texttt{mo}_{t-1}\). But what should we do about the prior for \(\texttt{mo}_1\), for which we don’t have a previous time period in the data?

We need to work out the marginal distribution of the first observation. Thankfully we can use the fact that AR models are stationary, so \(\text{Var}(\texttt{mo}_t) = \text{Var}(\texttt{mo}_{t-1})\) and \(\mathbb{E}(\texttt{mo}_t) = \mathbb{E}(\texttt{mo}_{t-1})\) for all \(t\). Therefore the marginal distribution of \(\texttt{mo}_1\) is the same as the marginal distribution of any \(\texttt{mo}_t\).

First we derive the marginal variance of \(\texttt{mo}_{t}\).

\[ \text{Var}(\texttt{mo}_t) = \text{Var}(\rho \texttt{mo}_{t-1} + \epsilon_t) \\ \text{Var}(\texttt{mo}_t) = \text{Var}(\rho \texttt{mo}_{t-1}) + \text{Var}(\epsilon_t) \] where the second line holds by independence of \(\epsilon_t\) and \(\epsilon_{t-1})\). Then, using the fact that \(Var(cX) = c^2Var(X)\) for a constant \(c\) and the fact that, by stationarity, \(\textrm{Var}(\texttt{mo}_{t-1}) = \textrm{Var}(\texttt{mo}_{t})\), we then obtain:

\[ \text{Var}(\texttt{mo}_t)= \rho^2 \text{Var}( \texttt{mo}_{t}) + \sigma_\texttt{mo}^2 \\ \text{Var}(\texttt{mo}_t) = \frac{\sigma_\texttt{mo}^2}{1 - \rho^2} \]

For the mean of \(\texttt{mo}_t\) things are a bit simpler:

\[ \mathbb{E}(\texttt{mo}_t) = \mathbb{E}(\rho \, \texttt{mo}_{t-1} + \epsilon_t) \\ \mathbb{E}(\texttt{mo}_t) = \mathbb{E}(\rho \, \texttt{mo}_{t-1}) + \mathbb{E}(\epsilon_t) \\ \]

Since \(\mathbb{E}(\epsilon_t) = 0\) by assumption we have

\[ \mathbb{E}(\texttt{mo}_t) = \mathbb{E}(\rho \, \texttt{mo}_{t-1}) + 0\\ \mathbb{E}(\texttt{mo}_t) = \rho \, \mathbb{E}(\texttt{mo}_{t}) \\ \mathbb{E}(\texttt{mo}_t) - \rho \mathbb{E}(\texttt{mo}_t) = 0 \\ \mathbb{E}(\texttt{mo}_t) = 0/(1 - \rho) \]

which for \(\rho \neq 1\) yields \(\mathbb{E}(\texttt{mo}_{t}) = 0\).

We now have the marginal distribution for \(\texttt{mo}_{t}\), which, in our case, we will use for \(\texttt{mo}_1\). The full AR(1) specification is then:

\[ \texttt{mo}_1 \sim \text{Normal}\left(0, \frac{\sigma_\texttt{mo}}{\sqrt{1 - \rho^2}}\right) \\ \texttt{mo}_t | \texttt{mo}_{t-1} \sim \text{Normal}\left(\rho \, \texttt{mo}_{t-1}, \sigma_\texttt{mo}\right) \forall t > 1 \]

Let’s compile and fit the model.

comp_hier_NB_mos <- cmdstan_model("stan_programs/hier_NB_regression_ncp_slopes_mos.stan")
fit_hier_NB_mos <-
  comp_hier_NB_mos$sample(
    data = standata_hier_long,
    chains = 4,
    parallel_chains = 4,
    adapt_delta = 0.95, 
    refresh = 200, 
    show_exceptions = FALSE
  )
Warning: 6 of 4000 (0.0%) transitions ended with a divergence.
See https://mc-stan.org/misc/warnings for details.
Running MCMC with 4 parallel chains...

Chain 1 Iteration:    1 / 2000 [  0%]  (Warmup) 
Chain 2 Iteration:    1 / 2000 [  0%]  (Warmup) 
Chain 3 Iteration:    1 / 2000 [  0%]  (Warmup) 
Chain 4 Iteration:    1 / 2000 [  0%]  (Warmup) 
Chain 4 Iteration:  200 / 2000 [ 10%]  (Warmup) 
Chain 2 Iteration:  200 / 2000 [ 10%]  (Warmup) 
Chain 1 Iteration:  200 / 2000 [ 10%]  (Warmup) 
Chain 3 Iteration:  200 / 2000 [ 10%]  (Warmup) 
Chain 4 Iteration:  400 / 2000 [ 20%]  (Warmup) 
Chain 2 Iteration:  400 / 2000 [ 20%]  (Warmup) 
Chain 3 Iteration:  400 / 2000 [ 20%]  (Warmup) 
Chain 1 Iteration:  400 / 2000 [ 20%]  (Warmup) 
Chain 4 Iteration:  600 / 2000 [ 30%]  (Warmup) 
Chain 2 Iteration:  600 / 2000 [ 30%]  (Warmup) 
Chain 3 Iteration:  600 / 2000 [ 30%]  (Warmup) 
Chain 1 Iteration:  600 / 2000 [ 30%]  (Warmup) 
Chain 4 Iteration:  800 / 2000 [ 40%]  (Warmup) 
Chain 2 Iteration:  800 / 2000 [ 40%]  (Warmup) 
Chain 3 Iteration:  800 / 2000 [ 40%]  (Warmup) 
Chain 1 Iteration:  800 / 2000 [ 40%]  (Warmup) 
Chain 4 Iteration: 1000 / 2000 [ 50%]  (Warmup) 
Chain 4 Iteration: 1001 / 2000 [ 50%]  (Sampling) 
Chain 2 Iteration: 1000 / 2000 [ 50%]  (Warmup) 
Chain 2 Iteration: 1001 / 2000 [ 50%]  (Sampling) 
Chain 3 Iteration: 1000 / 2000 [ 50%]  (Warmup) 
Chain 3 Iteration: 1001 / 2000 [ 50%]  (Sampling) 
Chain 4 Iteration: 1200 / 2000 [ 60%]  (Sampling) 
Chain 1 Iteration: 1000 / 2000 [ 50%]  (Warmup) 
Chain 1 Iteration: 1001 / 2000 [ 50%]  (Sampling) 
Chain 2 Iteration: 1200 / 2000 [ 60%]  (Sampling) 
Chain 3 Iteration: 1200 / 2000 [ 60%]  (Sampling) 
Chain 4 Iteration: 1400 / 2000 [ 70%]  (Sampling) 
Chain 1 Iteration: 1200 / 2000 [ 60%]  (Sampling) 
Chain 2 Iteration: 1400 / 2000 [ 70%]  (Sampling) 
Chain 3 Iteration: 1400 / 2000 [ 70%]  (Sampling) 
Chain 4 Iteration: 1600 / 2000 [ 80%]  (Sampling) 
Chain 1 Iteration: 1400 / 2000 [ 70%]  (Sampling) 
Chain 2 Iteration: 1600 / 2000 [ 80%]  (Sampling) 
Chain 4 Iteration: 1800 / 2000 [ 90%]  (Sampling) 
Chain 3 Iteration: 1600 / 2000 [ 80%]  (Sampling) 
Chain 1 Iteration: 1600 / 2000 [ 80%]  (Sampling) 
Chain 2 Iteration: 1800 / 2000 [ 90%]  (Sampling) 
Chain 4 Iteration: 2000 / 2000 [100%]  (Sampling) 
Chain 4 finished in 23.6 seconds.
Chain 1 Iteration: 1800 / 2000 [ 90%]  (Sampling) 
Chain 3 Iteration: 1800 / 2000 [ 90%]  (Sampling) 
Chain 2 Iteration: 2000 / 2000 [100%]  (Sampling) 
Chain 2 finished in 25.5 seconds.
Chain 1 Iteration: 2000 / 2000 [100%]  (Sampling) 
Chain 1 finished in 26.4 seconds.
Chain 3 Iteration: 2000 / 2000 [100%]  (Sampling) 
Chain 3 finished in 26.5 seconds.

All 4 chains finished successfully.
Mean chain execution time: 25.5 seconds.
Total execution time: 26.8 seconds.
fit_hier_NB_mos$summary(variables = c("rho", "sigma_mo", "mo"))
# A tibble: 38 × 10
   variable   mean median     sd    mad     q5    q95  rhat ess_bulk ess_tail
   <chr>     <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl> <dbl>    <dbl>    <dbl>
 1 rho       0.781  0.787 0.0786 0.0776  0.637  0.897  1.01    1204.    1507.
 2 sigma_mo  0.572  0.564 0.0960 0.0911  0.433  0.744  1.00    1162.    2022.
 3 mo[1]    -2.32  -2.29  0.579  0.558  -3.33  -1.44   1.00    1984.    2742.
 4 mo[2]    -1.71  -1.68  0.500  0.478  -2.55  -0.939  1.00    1795.    2553.
 5 mo[3]    -1.87  -1.84  0.517  0.494  -2.75  -1.07   1.00    1813.    2479.
 6 mo[4]    -1.57  -1.55  0.507  0.473  -2.44  -0.778  1.00    1871.    2609.
 7 mo[5]    -1.78  -1.76  0.509  0.497  -2.65  -0.990  1.00    1907.    2634.
 8 mo[6]    -1.54  -1.52  0.500  0.482  -2.40  -0.741  1.00    1748.    2203.
 9 mo[7]    -1.64  -1.61  0.499  0.488  -2.50  -0.846  1.00    1757.    2637.
10 mo[8]    -1.00  -0.985 0.467  0.452  -1.80  -0.271  1.00    1654.    2360.
# ℹ 28 more rows

In the interest of brevity, we won’t go on expanding the model, though we certainly could. What other information would help us understand the data generating process better? What other aspects of the data generating process might we want to capture that we’re not capturing now?

As usual, we run through our posterior predictive checks.

y_rep <- fit_hier_NB_mos$draws("y_rep", format = "matrix")
ppc_dens_overlay(
  y = standata_hier_long$complaints,
  yrep = y_rep[1:200,]
)

select_vec <- which(standata_hier_long$mo_idx %in% 1:12)
ppc_stat_grouped(
  y = standata_hier_long$complaints[select_vec],
  yrep = y_rep[, select_vec],
  group = standata_hier_long$mo_idx[select_vec],
  stat = 'mean'
)
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

As we can see, our monthly random intercept has captured a monthly pattern across all the buildings. We can also compare the prior and posterior for the autoregressive parameter to see how much we’ve learned.

rho_draws <- cbind(
  prior = 2 * rbeta(4000, 10, 5) - 1,
  posterior = as.vector(fit_hier_NB_mos$draws("rho", format = "matrix"))
)
mcmc_hist(
  rho_draws,
  freq = FALSE,
  binwidth = 0.025,
  facet_args = list(nrow = 2)
) + xlim(-1, 1)

We’ll also make the intervals plot that shows the posterior predictive distribution of complaints for different numbers of traps:

ppc_intervals(
  y = standata_hier_long$complaints,
  yrep = y_rep,
  x = standata_hier_long$traps + rnorm(standata_hier_long$N, 0, 0.2)
) +
  labs(
    x = "Number of traps (jittered)", 
    y = "Predicted number of complaints"
  )

It looks as if our model finally generates a reasonable posterior predictive distribution for all numbers of traps, and appropriately captures the tails of the data generating process.

We can also plot the mo parameters directly and look at the estimated latent AR(1) trend. This isn’t a PPC, just looking at the posterior distribution of mo, but we can trick the PPC plot functions into working for this too:

# this is a bit of an abuse of the ppc functions:
# instead of data i'm setting y equal to the posterior means of mo
# instead of yrep i'm setting  yrep equal to the matrix of mo draws
ppc_ribbon(
  y = fit_hier_NB_mos$summary(variables = "mo", "mean")$mean,
  yrep = fit_hier_NB_mos$draws("mo", format = "matrix"),
  prob = 0.5,
  prob_outer = 0.9
) + 
  legend_none() + 
  labs(x = "Month", y = "Latent AR(1)")

Using our model: forecasts for decision making

Our model seems to be fitting well, so now we will go ahead and use the model to help us make a decision about how many traps to put in our buildings. This decision will balance the benefit of using more traps (fewer complaints and calls to the exterminator) with the cost of setting up and maintaining the traps.

Remember our problem definition, formalized above with the notation:

\[ \arg\max_{\textrm{traps} \in \mathbb{N}} \mathbb{E}_{\text{complaints}}[-L(\textrm{complaints}(\textrm{traps})) - \textrm{C}(\textrm{traps})] \] where \(L\) is the loss associated with a number of complaints, which depends on the number of traps set, and \(C\) is the cost of installing and maintaining the traps, which also depends on the number of traps set. Let’s define a new term,\(u(\textrm{traps})\), \[ u(\textrm{traps}) = -L(\textrm{complaints}(\textrm{traps})) - \textrm{C}(\textrm{traps}) \] where \(u\) stands for utility.

Then the problem is: \[ \arg\max_{\textrm{traps} \in \mathbb{N}} \mathbb{E}_{\text{complaints}}[u(\textrm{traps})] \] We’ve done nothing but change the name of the term over which we’re taking the expectation. We’ve done so for clarity of exposition; instead of couching our problem in terms of maximizing expected negative losses and costs, we can frame our problem in terms of maximizing the expected benefits.

We will generate forecasts of complaints for the buildings for a period of 12 months, which will let us quantify our uncertainty about the complaints, and ultimately utility, projected for any hypothetical number of traps we could use in each building. That is, we will ask the question: over the next year, how much would it benefit each building if they used 0 traps, or 1 trap, or 2 traps, etc.?

To do this we first need to predict the number of complaints over 12 months in each building for each hypothetical number of traps (we’ll limit it to a max of 20 traps).

# this is the same model as the previous one just with extra code to
# do the future forecasts
comp_forecast <- cmdstan_model("stan_programs/hier_NB_regression_ncp_slopes_mos_predict.stan")
# add number of months to predict and hypothetical numbers of traps to the data list
standata_hier_long$M_forward <- 12
standata_hier_long$hypo_traps <- 0:20
standata_hier_long$N_hypo_traps <- length(standata_hier_long$hypo_traps)

# we'll re-estimate the posterior because it's fast here
# but you can just run generated quantities if you want (and use posterior from previous model)
# https://mc-stan.org/cmdstanr/reference/model-method-generate-quantities.html
fit_forecast <- comp_forecast$sample(
  data = standata_hier_long,
  chains = 4, 
  parallel_chains = 4,
  adapt_delta = 0.95,
  refresh = 250,
  show_exceptions = FALSE
)
Running MCMC with 4 parallel chains...

Chain 1 Iteration:    1 / 2000 [  0%]  (Warmup) 
Chain 2 Iteration:    1 / 2000 [  0%]  (Warmup) 
Chain 3 Iteration:    1 / 2000 [  0%]  (Warmup) 
Chain 4 Iteration:    1 / 2000 [  0%]  (Warmup) 
Chain 2 Iteration:  250 / 2000 [ 12%]  (Warmup) 
Chain 1 Iteration:  250 / 2000 [ 12%]  (Warmup) 
Chain 4 Iteration:  250 / 2000 [ 12%]  (Warmup) 
Chain 3 Iteration:  250 / 2000 [ 12%]  (Warmup) 
Chain 2 Iteration:  500 / 2000 [ 25%]  (Warmup) 
Chain 4 Iteration:  500 / 2000 [ 25%]  (Warmup) 
Chain 1 Iteration:  500 / 2000 [ 25%]  (Warmup) 
Chain 3 Iteration:  500 / 2000 [ 25%]  (Warmup) 
Chain 2 Iteration:  750 / 2000 [ 37%]  (Warmup) 
Chain 4 Iteration:  750 / 2000 [ 37%]  (Warmup) 
Chain 1 Iteration:  750 / 2000 [ 37%]  (Warmup) 
Chain 3 Iteration:  750 / 2000 [ 37%]  (Warmup) 
Chain 2 Iteration: 1000 / 2000 [ 50%]  (Warmup) 
Chain 2 Iteration: 1001 / 2000 [ 50%]  (Sampling) 
Chain 4 Iteration: 1000 / 2000 [ 50%]  (Warmup) 
Chain 4 Iteration: 1001 / 2000 [ 50%]  (Sampling) 
Chain 1 Iteration: 1000 / 2000 [ 50%]  (Warmup) 
Chain 1 Iteration: 1001 / 2000 [ 50%]  (Sampling) 
Chain 3 Iteration: 1000 / 2000 [ 50%]  (Warmup) 
Chain 3 Iteration: 1001 / 2000 [ 50%]  (Sampling) 
Chain 2 Iteration: 1250 / 2000 [ 62%]  (Sampling) 
Chain 4 Iteration: 1250 / 2000 [ 62%]  (Sampling) 
Chain 1 Iteration: 1250 / 2000 [ 62%]  (Sampling) 
Chain 3 Iteration: 1250 / 2000 [ 62%]  (Sampling) 
Chain 4 Iteration: 1500 / 2000 [ 75%]  (Sampling) 
Chain 2 Iteration: 1500 / 2000 [ 75%]  (Sampling) 
Chain 1 Iteration: 1500 / 2000 [ 75%]  (Sampling) 
Chain 3 Iteration: 1500 / 2000 [ 75%]  (Sampling) 
Chain 4 Iteration: 1750 / 2000 [ 87%]  (Sampling) 
Chain 1 Iteration: 1750 / 2000 [ 87%]  (Sampling) 
Chain 2 Iteration: 1750 / 2000 [ 87%]  (Sampling) 
Chain 4 Iteration: 2000 / 2000 [100%]  (Sampling) 
Chain 4 finished in 23.5 seconds.
Chain 3 Iteration: 1750 / 2000 [ 87%]  (Sampling) 
Chain 1 Iteration: 2000 / 2000 [100%]  (Sampling) 
Chain 1 finished in 24.5 seconds.
Chain 2 Iteration: 2000 / 2000 [100%]  (Sampling) 
Chain 2 finished in 25.3 seconds.
Chain 3 Iteration: 2000 / 2000 [100%]  (Sampling) 
Chain 3 finished in 26.3 seconds.

All 4 chains finished successfully.
Mean chain execution time: 24.9 seconds.
Total execution time: 26.7 seconds.
Warning: 1 of 4000 (0.0%) transitions ended with a divergence.
See https://mc-stan.org/misc/warnings for details.
# use new draws_rvars format in posterior package
# http://mc-stan.org/posterior/articles/rvar.html

# get draws of predicted complaints
draws <- as_draws_rvars(fit_forecast$draws("y_pred"))
predicted_complaints <- draws$y_pred

# draws (hidden) x buildings x number of traps
dim(predicted_complaints)
[1] 10 21
dim(draws_of(predicted_complaints))
[1] 4000   10   21

Factoring in the various costs for the client

Losses due to number of complaints

The client has a policy that for every 10 complaints, they’ll call an exterminator costing them $100, so that amounts to $10 per complaint:

# this is a bit of an approximation because complaints don't have to 
# be a multiple of 10, but this is close enough. we could do a little more
# work to get it exact if we want
losses <- -10 * predicted_complaints
dimnames(losses) <- list("building" = NULL, "traps" = NULL)

For the purpose of this example we omit the potential losses due to tenants leaving the building if the number of complaints is high enough or new tenants not wanted to rent in the building if the complaints become public (e.g., in online reviews).

Cost of materials and labor for maintaining the traps

Another key input to our analysis will be the cost of installing traps (really bait stations). We’re simulating the number of complaints we receive over the course of a year, so we need to understand the cost associated with maintaining each bait station over the course of a year. There’s the cost attributed to the raw bait station, which is the plastic housing and the bait material, a peanut-buttery substance that’s injected with insecticide. The cost of maintaining one bait station for a year plus monthly replenishment of the bait material is about $20.

N_traps <- 20
costs <- 20 * 0:N_traps

We’ll also need labor for maintaining the bait stations, which need to be serviced every two months. If there are fewer than five traps, our in-house maintenance staff can manage the stations (about one hour of work every two months at $20/hour), but above five traps we need to hire outside pest control to help out. They’re a bit more expensive, so we’ve put their cost at $30 / hour. Each five traps should require an extra person-hour of work, so that’s factored in as well. The marginal person hours above five traps are at the higher pest-control-labor rate.

N_months_labor <- standata_hier_long$M_forward / 2
hourly_rate_low <- 20
hourly_rate_high <- 30
costs <- costs +
  (0:N_traps < 5 & 0:N_traps > 0) * (N_months_labor * hourly_rate_low) +
  (0:N_traps >= 5 & 0:N_traps < 10) * (N_months_labor * (hourly_rate_low + 1 * hourly_rate_high)) +
  (0:N_traps >= 10 & 0:N_traps < 15) * (N_months_labor * (hourly_rate_low + 2 * hourly_rate_high)) +
  (0:N_traps >= 15) * (N_months_labor * (hourly_rate_low + 3 * hourly_rate_high))

ggplot(data.frame(traps = 0:20, cost = costs), 
       aes(x = traps, y = costs)) + 
  geom_point() + 
  geom_line()

We can now include the materials and labor into the total utility by subtracting costs from the losses we already have:

utility <- sweep(losses, MARGIN = 2, STATS = costs, FUN = '-')

Plotting forecasts

We can now plot curves with number of traps on the x-axis and utility forecasts and uncertainty intervals on the y-axis.

# calculate means and central intervals (for example 90% intervals)
mean_utility <- t(apply(utility, c(1, 2), mean))
lower_utility <- t(apply(utility, c(1, 2), quantile, 0.05))
upper_utility <- t(apply(utility, c(1, 2), quantile, 0.95))

buildings <- seq(1, ncol(mean_utility))
traps <- seq(0, nrow(mean_utility) - 1)

utility_df <-
  data.frame(
    expected_utility = as.vector(mean_utility),
    lower = as.vector(lower_utility),
    upper = as.vector(upper_utility),
    traps = rep(traps, times = max(buildings)),
    building = rep(buildings, each = max(traps) + 1)
  )

forecast_plot <- ggplot(data = utility_df, aes(x = traps, y = expected_utility)) +
  geom_ribbon(aes(ymin = lower, ymax = upper), fill = "skyblue2") +
  geom_line() +
  labs(x = "Number of traps", y = "Utility ($)") +
  facet_wrap(~ building, scales = "free_x", ncol = 3, labeller = label_both) + 
  theme(text = element_text(size = 18))
forecast_plot

Finally, we can overlay the answer to the decision problem and show the maximum expected utility and the corresponding number of traps to set.

decision_df <- utility_df %>%
  group_by(building) %>%
  summarise(
    arg_max = which.max(expected_utility) - 1, # the number of traps to use
    max = max(expected_utility)
  )
forecast_plot + 
  geom_vline(data = decision_df,
             mapping = aes(xintercept = arg_max),
             col = "red2", linetype = 2) +
  geom_hline(data = decision_df,
             mapping = aes(yintercept = max),
             col = "red2", linetype = 2) +
  geom_point(data = decision_df, 
             mapping = aes(x = arg_max, y = max), 
             col = "red2", size = 4)


Left as an exercise (unless we have time):

  • How would we make a forecast for a new building?

  • When maximizing expected utility, we take expectations at each trap count for each building, and choose the trap number that maximizes expected utility (revenue) for that building. This will be called a maximum revenue strategy. How can we generate the distribution of portfolio revenue (i.e. the sum of revenue across all the buildings) under the maximum revenue strategy from the predictions we already have?

Gaussian process instead of AR(1)

We won’t cover this in class, but for anyone interested in Gaussian processes (GP) this section explains how to implement a GP as an alternative to the AR(1). This is pretty math heavy and just intended to give people interested in GPs an example of how to do it. If you’re not interested in GPs feel free to ignore this.

Joint density for AR(1) process

We can derive the joint distribution for the AR(1) process before we move to the Gaussian process (GP) which will give us a little more insight into what a GP is. Remember that we’ve specified the AR(1) prior as:

\[ \begin{aligned} \texttt{mo}_1 & \sim \text{Normal}\left(0, \frac{\sigma_\texttt{mo}}{\sqrt{1 - \rho^2}}\right) \\ \texttt{mo}_t & \sim \text{Normal}\left(\rho \, \texttt{mo}_{t-1}, \sigma_\texttt{mo}\right) \forall t > 1 \end{aligned} \] Rewriting our process in terms of the errors will make the derivation of the joint distribution clearer

\[ \begin{aligned} \texttt{mo}_1 & \sim \text{Normal}\left(0, \frac{\sigma_\texttt{mo}}{\sqrt{1 - \rho^2}}\right) \\ \texttt{mo}_t & = \rho \, \texttt{mo}_{t-1} + \sigma_\texttt{mo}\epsilon_t \\ \epsilon_t & \sim \text{Normal}\left(0, 1\right) \end{aligned} \]

Given that our first term \(\texttt{mo}_1\) is normally distributed, and subsequent terms are sums of normal random variables, this means that jointly the vector mo is multivariate normal.

More formally, if we have a vector \(x \in \mathbb{R}^M\) which is multivariate normal, \(x \sim \text{MultiNormal}(0, \Sigma)\), and we left-multiply \(x\) by a nonsingular matrix \(L \in \mathbb{R}^{M\times M}\), then \(y = Lx \sim \text{MultiNormal}(0, L\Sigma L^T)\). We can use this fact to show that our vector mo is jointly multivariate normal.

Just as before with the noncentered parameterization, we’ll take a vector \(\texttt{mo_raw} \in \mathbb{R}^M\) (in which each element is univariate \(\text{Normal}(0,1)\)) and transform it into mo. But instead of doing this with scalar transformations like we did previously (e.g., in the section Time varying effects and structured priors) we will now do it with linear algebra operations.

The trick is that by specifying each element of \(\texttt{mo_raw}\) to be distributed \(\text{Normal}(0,1)\) we are implicitly defining \(\texttt{mo_raw} \sim \text{MultiNormal}(0, I_M)\), where \(I_M\) is the \(M \times M\) identity matrix. Then we do a linear transformation using a matrix \(L\) and assign the result to mo, i.e., \(\texttt{mo} = L\times\texttt{mo_raw}\). This gives us \(\texttt{mo} \sim \text{MultiNormal}(0, LI_M L^T)\) and \(LI_M L^T = LL^T\).

Consider the case where we have three elements in mo and we want to figure out the form for \(L\).

The first element of mo is fairly straightforward, because it mirrors our earlier parameterization of the AR(1) prior. The only difference is that we’re explicitly adding the last two terms of mo_raw into the equation so we can use matrix algebra for our transformation. \[ \texttt{mo}_1 = \frac{\sigma_{\texttt{mo}}}{\sqrt{1 - \rho^2}} \times \texttt{mo_raw}_1 + 0 \times \texttt{mo_raw}_2 + 0 \times \texttt{mo_raw}_3\\ \] The second element is a bit more complicated:

\[ \begin{aligned} \texttt{mo}_2 & = \rho \texttt{mo}_1 + \sigma_{\texttt{mo}}\,\texttt{mo_raw}_2 + 0 \times \texttt{mo_raw}_3 \\ & = \rho \left(\frac{\sigma_{\texttt{mo}}}{\sqrt{1 - \rho^2}} \times \texttt{mo_raw}_1\right) + \sigma_{\texttt{mo}}\,\texttt{mo_raw}_2 + 0 \times \texttt{mo_raw}_3 \\[5pt] & = \frac{\rho \sigma_{\texttt{mo}}}{\sqrt{1 - \rho^2}} \times \texttt{mo_raw}_1 + \sigma_{\texttt{mo}}\,\texttt{mo_raw}_2 + 0 \times \texttt{mo_raw}_3 \\ \end{aligned} \]

While the third element will involve all three terms \[ \begin{aligned} \texttt{mo}_3 & = \rho \, \texttt{mo}_2 + \sigma_{\texttt{mo}}\,\texttt{mo_raw}_3 \\ & = \rho \left(\frac{\rho \sigma_{\texttt{mo}}}{\sqrt{1 - \rho^2}} \times \texttt{mo_raw}_1 + \sigma_{\texttt{mo}}\,\texttt{mo_raw}_2\right) + \sigma_{\texttt{mo}} \texttt{mo_raw}_3 \\[5pt] & = \frac{\rho^2 \sigma_{\texttt{mo}}}{\sqrt{1 - \rho^2}} \times \texttt{mo_raw}_1 + \rho \, \sigma_{\texttt{mo}}\,\texttt{mo_raw}_2 + \sigma_{\texttt{mo}}\,\texttt{mo_raw}_3 \\ \end{aligned} \]

Writing this all together:

\[ \begin{aligned} \texttt{mo}_1 & = \frac{\sigma_{\texttt{mo}}}{\sqrt{1 - \rho^2}} \times \texttt{mo_raw}_1 + 0 \times \texttt{mo_raw}_2 + 0 \times \texttt{mo_raw}_3\\[3pt] \texttt{mo}_2 & = \frac{\rho \sigma_{\texttt{mo}}}{\sqrt{1 - \rho^2}} \times \texttt{mo_raw}_1 + \sigma_{\texttt{mo}}\,\texttt{mo_raw}_2 + 0 \times \texttt{mo_raw}_3 \\[3pt] \texttt{mo}_3 & = \frac{\rho^2 \sigma_{\texttt{mo}}}{\sqrt{1 - \rho^2}} \times \texttt{mo_raw}_1 + \rho \, \sigma_{\texttt{mo}}\,\texttt{mo_raw}_2 + \sigma_{\texttt{mo}}\,\texttt{mo_raw}_3 \\ \end{aligned} \] Separating this into a matrix of coefficients \(L\) and the vector mo_raw:

\[ \texttt{mo} = \begin{bmatrix} \sigma_\texttt{mo} / \sqrt{1 - \rho^2} & 0 & 0 \\ \rho \sigma_\texttt{mo} / \sqrt{1 - \rho^2} & \sigma_\texttt{mo} & 0 \\ \rho^2 \sigma_\texttt{mo} / \sqrt{1 - \rho^2} & \rho \,\sigma_\texttt{mo} & \sigma_\texttt{mo} \end{bmatrix} \times \texttt{mo_raw} \]

If we multiply \(L\) on the right by its transpose \(L^T\), we’ll get expressions for the covariance matrix of our multivariate random vector mo:

\[ \begin{bmatrix} \sigma_\texttt{mo} / \sqrt{1 - \rho^2} & 0 & 0 \\ \rho \sigma_\texttt{mo} / \sqrt{1 - \rho^2} & \sigma_\texttt{mo} & 0 \\ \rho^2 \sigma_\texttt{mo} / \sqrt{1 - \rho^2} & \rho \,\sigma_\texttt{mo} & \sigma_\texttt{mo} \end{bmatrix} \times \begin{bmatrix} \sigma_\texttt{mo} / \sqrt{1 - \rho^2} & \rho \sigma_\texttt{mo} / \sqrt{1 - \rho^2} & \rho^2 \sigma_\texttt{mo} / \sqrt{1 - \rho^2} \\ 0 & \sigma_\texttt{mo} & \rho \,\sigma_\texttt{mo} \\ 0 & 0 & \sigma_\texttt{mo} \end{bmatrix} \]

which results in:

\[ \begin{bmatrix} \sigma^2_\texttt{mo} / (1 - \rho^2) & \rho \, \sigma^2_\texttt{mo} / (1 - \rho^2) & \rho^2 \, \sigma^2_\texttt{mo} / (1 - \rho^2)\\ \rho \, \sigma^2_\texttt{mo} / (1 - \rho^2) & \sigma^2_\texttt{mo} / (1 - \rho^2) & \rho \, \sigma^2_\texttt{mo} / (1 - \rho^2) \\ \rho^2 \sigma^2_\texttt{mo} / (1 - \rho^2) & \rho \, \sigma^2_\texttt{mo} / (1 - \rho^2) & \sigma^2_\texttt{mo} / (1 - \rho^2) \end{bmatrix} \] We can simplify this result by dividing the matrix by \(\sigma^2_\texttt{mo} / (1 - \rho^2)\) to get

\[ \begin{bmatrix} 1 & \rho & \rho^2 \\ \rho & 1 & \rho \\ \rho^2 & \rho & 1 \end{bmatrix} \]

This should generalize to higher dimensions pretty easily.

We can replace the Stan code defining mo in stan_programs/hier_NB_regression_ncp_slopes_mos.stan with the following:

vector[M] mo;
{
  matrix[M,M] A = rep_matrix(0, M, M);
  A[1,1] = sigma_mo / sqrt(1 - rho^2);
  for (m in 2:M) {
    A[m,1] = rho^(m-1) * sigma_mo / sqrt(1 - rho^2);
  }
  for (m in 2:M) {
    A[m,m] = sigma_mo;
    for (i in (m + 1):M) {
      A[i,m] = rho^(i-m) * sigma_mo;
    }
  }
  mo = A * mo_raw;
}

Cholesky decomposition

If we only knew the covariance matrix of our process, say a matrix called \(\Sigma\), and we had a way of decomposing \(\Sigma\) into \(L L^T\) we wouldn’t need to write out the equation for the vector. Luckily, there is a matrix decomposition called the Cholesky decomposition that does just that. The Stan function for the composition is called cholesky_decompose. Instead of writing out the explicit equation we can just do the following

vector[M] mo;
{
  mo = cholesky_decompose(Sigma) * mo_raw;
}

provided we’ve defined Sigma appropriately elsewhere in the transformed parameters block. Note that the matrix \(L\) is lower-triangular (i.e. all elements in the upper right triangle of the matrix are zero).

We’ve already derived the covariance matrix Sigma for the three-dimensional AR(1) process above by explicitly calculating \(L L^T\), but we can do so using the rules of covariance and the way our process is defined. We already know that each element of \(\texttt{mo}_t\) has marginal variance \(\sigma^2_\texttt{mo} / (1- \rho^2)\), but we don’t know the covariance of \(\texttt{mo}_t\) and \(\texttt{mo}_{t+h}\). We can do so recursively. First we derive the covariance for two elements of \(\texttt{mo}_t\) separated by one month:

\[ \text{Cov}(\texttt{mo}_{t+1},\texttt{mo}_{t}) = \text{Cov}(\rho \, \texttt{mo}_{t} + \sigma_\texttt{mo}\epsilon_{t+1},\texttt{mo}_{t}) \\ \text{Cov}(\texttt{mo}_{t+1},\texttt{mo}_{t}) = \rho \text{Cov}(\texttt{mo}_{t},\texttt{mo}_{t}) + \sigma_\texttt{mo}\text{Cov}(\epsilon_{t+1},\texttt{mo}_{t}) \\ \text{Cov}(\texttt{mo}_{t+1},\texttt{mo}_{t}) = \rho \text{Var}(\texttt{mo}_t) + 0 \]

Then we define the covariance for \(\text{Cov}(\texttt{mo}_{t+h},\texttt{mo}_{t})\) in terms of \(\text{Cov}(\texttt{mo}_{t+h-1},\texttt{mo}_{t})\)

\[ \text{Cov}(\texttt{mo}_{t+h},\texttt{mo}_{t}) = \text{Cov}(\rho \, \texttt{mo}_{t+h-1} + \sigma_\texttt{mo}\epsilon_{t+h},\texttt{mo}_{t}) \\ \text{Cov}(\texttt{mo}_{t+h},\texttt{mo}_{t}) = \rho \, \text{Cov}(\texttt{mo}_{t+h-1},\texttt{mo}_{t}) \\ \] Which we can use to recursively get at the covariance we need:

\[ \text{Cov}(\texttt{mo}_{t+h},\texttt{mo}_{t}) = \rho \, \text{Cov}(\texttt{mo}_{t+h-1},\texttt{mo}_{t}) \\ \text{Cov}(\texttt{mo}_{t+h},\texttt{mo}_{t}) = \rho \,( \rho \, \text{Cov}(\texttt{mo}_{t+h-2},\texttt{mo}_{t}) )\\ \dots \\ \text{Cov}(\texttt{mo}_{t+h},\texttt{mo}_{t}) = \rho^h \, \text{Cov}(\texttt{mo}_{t},\texttt{mo}_{t}) \\ \text{Cov}(\texttt{mo}_{t+h},\texttt{mo}_{t}) = \rho^h \, \sigma_\texttt{mo}^2/(1 - \rho^2) \\ \]

One way to write this in Stan code is:

vector[M] mo;
{
  matrix[M,M] Sigma;
  for (m in 1:M) {
    Sigma[m,m] = 1.0;
    for (i in (m + 1):M) {
      Sigma[i,m] = rho^(i - m);
      Sigma[m,i] = Sigma[i,m];
    }
  }
  Sigma = Sigma * sigma_mo^2 / (1 - rho^2);
  mo = cholesky_decompose(Sigma) * mo_raw;
}

Extension to Gaussian processes

The prior we defined for mo is strictly speaking a Gaussian process. It is a stochastic process that is distributed as jointly multivariate normal for any finite value of \(M\). Formally, we could write the above prior for mo like so:

\[ \begin{aligned} \sigma_\texttt{mo} & \sim \text{Normal}(0, 1) \\ \rho & \sim \text{GenBeta}(-1,1,10, 5) \\ \texttt{mo}_t & \sim \text{GP}\left( 0, K(t | \sigma_\texttt{mo},\rho) \right) \\ \end{aligned} \] The notation \(K(t | \sigma_\texttt{mo},\rho)\) defines the covariance matrix of the process over the domain \(t\), which is months.

In other words:

\[ \text{Cov}(\texttt{mo}_t,\texttt{mo}_{t+h}) = k(t, t+h | \sigma_\texttt{mo}, \rho) \]

We’ve already derived the covariance for our process. What if we want to use a different definition of Sigma?

As the above example shows defining a proper covariance matrix will yield a proper multivariate normal prior on a parameter. We need a way of defining a proper covariance matrix. These are symmetric positive definite matrices. It turns out there is a class of functions that define proper covariance matrices, called kernel functions. These functions are applied elementwise to construct a covariance matrix, \(K\):

\[ K_{[t,t+h]} = k(t, t+h | \theta) \] where \(\theta\) are the hyperparameters that define the behavior of covariance matrix.

One such function is called the exponentiated quadratic function, and it happens to be implemented in Stan as cov_exp_quad. The function is defined as:

\[ \begin{aligned} k(t, t+h | \theta) & = \alpha^2 \exp \left( - \dfrac{1}{2\ell^2} ((t+h) - t)^2 \right) \\ & = \alpha^2 \exp \left( - \dfrac{h^2}{2\ell^2} \right) \end{aligned} \]

The exponentiated quadratic kernel has two components to theta, \(\alpha\), the marginal standard deviation of the stochastic process \(f\), and \(\ell\), the process length-scale.

The length-scale defines how quickly the covariance decays between time points, with large values of \(\ell\) yielding a covariance that decays slowly, and with small values of \(\ell\) yielding a covariance that decays rapidly. It can be seen interpreted as a measure of how nonlinear the mo process is in time.

The marginal standard deviation defines how large the fluctuations are on the output side, which in our case is the number of roach complaints per month across all buildings. It can be seen as a scale parameter akin to the scale parameter for our building-level hierarchical intercept, though it now defines the scale of the monthly deviations.

This kernel’s defining quality is its smoothness; the function is infinitely differentiable. That will present problems for our example, but if we add some noise the diagonal of our covariance matrix, the model will fit well.

\[ k(t, t+h | \theta) = \alpha^2 \exp \left( - \dfrac{h^2}{2\ell^2} \right) + \text{if } h = 0, \, \sigma^2_\texttt{noise} \text{ else } 0 \]

Compiling the GP model

comp_hier_NB_gp <- cmdstan_model("stan_programs/hier_NB_regression_ncp_slopes_mos_gp.stan")

Fitting the GP model to data

fit_hier_NB_gp <- comp_hier_NB_gp$sample(
  data = standata_hier_long,
  refresh = 250,
  chains = 4,
  parallel_chains = 4,
  adapt_delta = 0.9
)
Chain 1 Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
Chain 1 Exception: cholesky_decompose: Matrix m is not positive definite (in '/var/folders/s0/zfzm55px2nd2v__zlw5xfj2h0000gn/T/RtmpYWBQea/model-eced46324372.stan', line 49, column 4 to column 32)
Chain 1 If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
Chain 1 but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.
Chain 1 
Chain 1 Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
Chain 1 Exception: cholesky_decompose: Matrix m is not positive definite (in '/var/folders/s0/zfzm55px2nd2v__zlw5xfj2h0000gn/T/RtmpYWBQea/model-eced46324372.stan', line 49, column 4 to column 32)
Chain 1 If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
Chain 1 but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.
Chain 1 
Chain 1 Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
Chain 1 Exception: cholesky_decompose: Matrix m is not positive definite (in '/var/folders/s0/zfzm55px2nd2v__zlw5xfj2h0000gn/T/RtmpYWBQea/model-eced46324372.stan', line 49, column 4 to column 32)
Chain 1 If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
Chain 1 but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.
Chain 1 
Chain 1 Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
Chain 1 Exception: neg_binomial_2_log_lpmf: Precision parameter is 0, but must be positive finite! (in '/var/folders/s0/zfzm55px2nd2v__zlw5xfj2h0000gn/T/RtmpYWBQea/model-eced46324372.stan', line 79, column 2 to line 80, column 66)
Chain 1 If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
Chain 1 but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.
Chain 1 
Chain 1 Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
Chain 1 Exception: neg_binomial_2_log_lpmf: Precision parameter is 0, but must be positive finite! (in '/var/folders/s0/zfzm55px2nd2v__zlw5xfj2h0000gn/T/RtmpYWBQea/model-eced46324372.stan', line 79, column 2 to line 80, column 66)
Chain 1 If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
Chain 1 but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.
Chain 1 
Chain 2 Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
Chain 2 Exception: gp_exp_quad_cov: sigma is 0, but must be positive! (in '/var/folders/s0/zfzm55px2nd2v__zlw5xfj2h0000gn/T/RtmpYWBQea/model-eced46324372.stan', line 44, column 4 to column 66)
Chain 2 If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
Chain 2 but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.
Chain 2 
Chain 2 Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
Chain 2 Exception: gp_exp_quad_cov: sigma is 0, but must be positive! (in '/var/folders/s0/zfzm55px2nd2v__zlw5xfj2h0000gn/T/RtmpYWBQea/model-eced46324372.stan', line 44, column 4 to column 66)
Chain 2 If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
Chain 2 but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.
Chain 2 
Chain 2 Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
Chain 2 Exception: gp_exp_quad_cov: length_scale is 0, but must be positive! (in '/var/folders/s0/zfzm55px2nd2v__zlw5xfj2h0000gn/T/RtmpYWBQea/model-eced46324372.stan', line 44, column 4 to column 66)
Chain 2 If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
Chain 2 but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.
Chain 2 
Chain 2 Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
Chain 2 Exception: gp_exp_quad_cov: sigma is 0, but must be positive! (in '/var/folders/s0/zfzm55px2nd2v__zlw5xfj2h0000gn/T/RtmpYWBQea/model-eced46324372.stan', line 44, column 4 to column 66)
Chain 2 If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
Chain 2 but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.
Chain 2 
Chain 2 Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
Chain 2 Exception: gp_exp_quad_cov: sigma is 0, but must be positive! (in '/var/folders/s0/zfzm55px2nd2v__zlw5xfj2h0000gn/T/RtmpYWBQea/model-eced46324372.stan', line 44, column 4 to column 66)
Chain 2 If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
Chain 2 but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.
Chain 2 
Chain 3 Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
Chain 3 Exception: gp_exp_quad_cov: length_scale is 0, but must be positive! (in '/var/folders/s0/zfzm55px2nd2v__zlw5xfj2h0000gn/T/RtmpYWBQea/model-eced46324372.stan', line 44, column 4 to column 66)
Chain 3 If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
Chain 3 but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.
Chain 3 
Chain 4 Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
Chain 4 Exception: neg_binomial_2_log_lpmf: Precision parameter is 0, but must be positive finite! (in '/var/folders/s0/zfzm55px2nd2v__zlw5xfj2h0000gn/T/RtmpYWBQea/model-eced46324372.stan', line 79, column 2 to line 80, column 66)
Chain 4 If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
Chain 4 but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.
Chain 4 
Chain 4 Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
Chain 4 Exception: neg_binomial_2_log_lpmf: Precision parameter is 0, but must be positive finite! (in '/var/folders/s0/zfzm55px2nd2v__zlw5xfj2h0000gn/T/RtmpYWBQea/model-eced46324372.stan', line 79, column 2 to line 80, column 66)
Chain 4 If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
Chain 4 but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.
Chain 4 
Chain 4 Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
Chain 4 Exception: gp_exp_quad_cov: sigma is 0, but must be positive! (in '/var/folders/s0/zfzm55px2nd2v__zlw5xfj2h0000gn/T/RtmpYWBQea/model-eced46324372.stan', line 44, column 4 to column 66)
Chain 4 If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
Chain 4 but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.
Chain 4 
Chain 4 Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
Chain 4 Exception: neg_binomial_2_log_lpmf: Precision parameter is 0, but must be positive finite! (in '/var/folders/s0/zfzm55px2nd2v__zlw5xfj2h0000gn/T/RtmpYWBQea/model-eced46324372.stan', line 79, column 2 to line 80, column 66)
Chain 4 If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
Chain 4 but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.
Chain 4 
Running MCMC with 4 parallel chains...

Chain 1 Iteration:    1 / 2000 [  0%]  (Warmup) 
Chain 2 Iteration:    1 / 2000 [  0%]  (Warmup) 
Chain 3 Iteration:    1 / 2000 [  0%]  (Warmup) 
Chain 4 Iteration:    1 / 2000 [  0%]  (Warmup) 
Chain 2 Iteration:  250 / 2000 [ 12%]  (Warmup) 
Chain 1 Iteration:  250 / 2000 [ 12%]  (Warmup) 
Chain 3 Iteration:  250 / 2000 [ 12%]  (Warmup) 
Chain 4 Iteration:  250 / 2000 [ 12%]  (Warmup) 
Chain 2 Iteration:  500 / 2000 [ 25%]  (Warmup) 
Chain 1 Iteration:  500 / 2000 [ 25%]  (Warmup) 
Chain 3 Iteration:  500 / 2000 [ 25%]  (Warmup) 
Chain 4 Iteration:  500 / 2000 [ 25%]  (Warmup) 
Chain 2 Iteration:  750 / 2000 [ 37%]  (Warmup) 
Chain 1 Iteration:  750 / 2000 [ 37%]  (Warmup) 
Chain 3 Iteration:  750 / 2000 [ 37%]  (Warmup) 
Chain 4 Iteration:  750 / 2000 [ 37%]  (Warmup) 
Chain 2 Iteration: 1000 / 2000 [ 50%]  (Warmup) 
Chain 2 Iteration: 1001 / 2000 [ 50%]  (Sampling) 
Chain 3 Iteration: 1000 / 2000 [ 50%]  (Warmup) 
Chain 1 Iteration: 1000 / 2000 [ 50%]  (Warmup) 
Chain 1 Iteration: 1001 / 2000 [ 50%]  (Sampling) 
Chain 3 Iteration: 1001 / 2000 [ 50%]  (Sampling) 
Chain 4 Iteration: 1000 / 2000 [ 50%]  (Warmup) 
Chain 4 Iteration: 1001 / 2000 [ 50%]  (Sampling) 
Chain 2 Iteration: 1250 / 2000 [ 62%]  (Sampling) 
Chain 1 Iteration: 1250 / 2000 [ 62%]  (Sampling) 
Chain 4 Iteration: 1250 / 2000 [ 62%]  (Sampling) 
Chain 2 Iteration: 1500 / 2000 [ 75%]  (Sampling) 
Chain 3 Iteration: 1250 / 2000 [ 62%]  (Sampling) 
Chain 1 Iteration: 1500 / 2000 [ 75%]  (Sampling) 
Chain 4 Iteration: 1500 / 2000 [ 75%]  (Sampling) 
Chain 2 Iteration: 1750 / 2000 [ 87%]  (Sampling) 
Chain 1 Iteration: 1750 / 2000 [ 87%]  (Sampling) 
Chain 3 Iteration: 1500 / 2000 [ 75%]  (Sampling) 
Chain 4 Iteration: 1750 / 2000 [ 87%]  (Sampling) 
Chain 2 Iteration: 2000 / 2000 [100%]  (Sampling) 
Chain 2 finished in 42.7 seconds.
Chain 4 Iteration: 2000 / 2000 [100%]  (Sampling) 
Chain 4 finished in 45.5 seconds.
Chain 1 Iteration: 2000 / 2000 [100%]  (Sampling) 
Chain 1 finished in 45.9 seconds.
Chain 3 Iteration: 1750 / 2000 [ 87%]  (Sampling) 
Chain 3 Iteration: 2000 / 2000 [100%]  (Sampling) 
Chain 3 finished in 53.9 seconds.

All 4 chains finished successfully.
Mean chain execution time: 47.0 seconds.
Total execution time: 54.2 seconds.

Examining the fit

Let’s look at the prior vs. posterior for the GP length scale parameter:

length_scale_draws <- cbind(
  prior = rgamma(4000, 10, 2),
  posterior = as.vector(fit_hier_NB_gp$draws('gp_len', format = "matrix"))
)
mcmc_areas(length_scale_draws)

From the plot above it only looks like we learned a small amount, however we can see a bigger difference between the prior and posterior if we consider how much we learned about the ratio of sigma_gp to the length scale gp_len:

sigma_gp <- fit_hier_NB_gp$draws("sigma_gp", format = "matrix")
gp_len <- fit_hier_NB_gp$draws("gp_len", format = "matrix")

noise_to_length_scale_ratio_draws <- cbind(
  prior = abs(rnorm(4000)) / rgamma(4000, 10, 2),
  posterior = as.vector(sigma_gp / gp_len)
)
mcmc_areas(noise_to_length_scale_ratio_draws)

This is a classic problem with Gaussian processes. Marginally, the length-scale parameter isn’t very well-identified by the data, but jointly the length-scale and the marginal standard deviation are well-identified.

And let’s compare the estimates for the time varying parameters between the AR(1) and GP. In this case the posterior mean of the time trend is essentially the same for the AR(1) and GP priors but the 50% uncertainty intervals are narrower for the AR(1):

mo_ar_intervals <- mcmc_intervals_data(fit_hier_NB_mos$draws("mo"), prob = 0.5)
mo_gp_intervals <- mcmc_intervals_data(fit_hier_NB_gp$draws("gp"), prob = 0.5)
plot_data <- bind_rows(mo_ar_intervals, mo_gp_intervals)
plot_data$prior <- factor(rep(c("AR1", "GP"), each = 36), levels = c("GP", "AR1"))
plot_data$time <- rep(1:36, times = 2)

ggplot(plot_data, aes(x = time, y = m, ymin = l, ymax = h, fill = prior)) +
  geom_ribbon(alpha = 2/3)

The way we coded the GP also lets us plot a decomposition of the GP into a monthly noise component (mo_noise in the Stan code) and the underlying smoothly varying trend (gp_exp_quad in the Stan code):

# visualizing 50% intervals
mo_noise_intervals <- mcmc_intervals_data(fit_hier_NB_gp$draws("mo_noise"), prob = 0.5)
gp_exp_quad_intervals <- mcmc_intervals_data(fit_hier_NB_gp$draws("gp_exp_quad"), prob = 0.5)

plot_data <- bind_rows(mo_noise_intervals, gp_exp_quad_intervals)
plot_data$time <- rep(1:36, times = 2)
plot_data$term <- factor(rep(c("Monthly Noise", "Smooth Trend"), each = 36),
                         levels = c("Smooth Trend", "Monthly Noise"))

ggplot(plot_data, aes(x = time, y = m, ymin = l, ymax = h, fill = term)) +
  geom_ribbon(alpha = 0.5) +
  geom_line(aes(color = term), size = 0.5) +
  ylab(NULL)